Kaggle Competition Submission: Titanic: Machine Learning from Disaster¶
- Author: Paul Tongyoo
- Contact: Message me on LinkedIn
- Date: May 20, 2025
- Official Competition Page: Titanic: Machine Learning from Disaster
- Latest Submission Score (Accuracy): 0.7727 (Top 39% of 15995 entries)
(Work in progress)
Table of Contents¶
- Project Summary
- Introduction
- Methodology
- Data Understanding
- Data Preparation
- Exploratory Data Analysis
- Target
- Individual Features x Target
- Composite Feature x Target
- Pclass x Sex
- Pclass x Title
- Pclass x Parch
- Pclass x SibSp
- Sex x Parch
- Sex x SibSp
- Pclass x Embarked
- Sex x Embarked
- Pclass x HasCabin
- Sex x HasCabin
- Parch x HasCabin
- SibSp x HasCabin
- Embarked x HasCabin
- Pclass x Cabin_count
- Sex x Cabin_count
- Pclass x Cabin_Location_s
- Sex x Cabin_Location_s
- Pclass x Deck_bin
- Sex x Deck_bin
- Parch x Deck_bin
- SibSp x Deck_bin
- Deck x Cabin_Location_s
- Pclass x Title_bin
- Sex x Title_bin
- Pclass x Age_Group
- Sex x Age_Group
- Pclass x FPP_log_bin
- Sex x FPP_log_bin
- Pclass x Parch_SibSp
- Sex x Parch_SibSp
- HasCabin x Parch_SibSp
- Hi-Cardinality Features
- Feature Priority Based on EDA
- Cross-Fold Distribution Shift Analysis
- Feature Engineering
- Model Development
- Hyperparameter Tuning
- Submission
- References
Project Summary¶
This project tackles the classic Kaggle challenge: predicting passenger survival on the Titanic using machine learning. It serves as a hands-on exercise in feature engineering, model development, and interpretability within a well-known dataset, allowing for deep exploration of structured data analysis and model evaluation techniques.
What I Did¶
- Conducted extensive feature engineering, combining domain knowledge and statistical validation to create globally smoothed target-encoded features and subgroup-specific smoothed features.
- Designed features around Pclass-Sex cohorts using conditional masking, binning, and normalization strategies (e.g., Pclass_Title_normalized, Age_Group, Deck_bin).
- Evaluated features using Chi-Squared tests, Cramér’s V, and cross-fold KL divergence, prioritizing variables with consistent distributions and statistically significant survival associations.
- Built a high-performing XGBoostClassifier pipeline, using ablation testing, SHAP plots, and hyperparameter tuning (e.g., max_depth, min_child_weight, gamma, reg_alpha) to balance accuracy and generalization. Achieved a mean average accuracy of 0.8114 across cross-validation of the Kaggle training data set.
What I Learned¶
- High cross-validation accuracy on the training set doesn't always translate to strong performance on the unseen test set — my 0.8114 CV accuracy dropped to 0.7727 on Kaggle's hidden test set. This highlighted how easily models can overfit to patterns specific to the training distribution, especially when engineered features subtly leak group identity or encode rare patterns that don't generalize.
- Smoothed rate encoding can outperform one-hot encoding when group sizes are sufficiently supported and carefully regularized to avoid leakage.
- Subgroup relevance masking (e.g., using Pclass_Sex) is tricky to enforce in practice; even zeroing out or setting NaNs doesn't fully eliminate feature leakage in tree-based models.
- KL divergence is a powerful diagnostic for assessing train–validation distribution shifts, especially for engineered composite features.
- SHAP plots revealed how certain features (e.g., P3_Female_Embarked_smoothed, Deck_bin) mislead the model when data sparsity or masking wasn't properly handled.
- Small gains in accuracy sometimes come at the cost of generalizability. Monitoring standard deviation in CV scores became as important as mean accuracy.
What's Next¶
- Implement model ensembling to incorporate predictions from a complimentary model (e.g. LogisticRegression) to improve generalization to unseen data.
- Continue experimenting with features and model configurations to improve accuracy score.
- Continue refining SHAP-driven debugging workflows to triage false positives/negatives and identify data segments where the model is overconfident or blind.
Introduction¶
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, Kaggle asked us to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Methodology¶
from datetime import datetime
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype, IntervalDtype
from scipy.stats import entropy
from scipy.special import rel_entr
import math
from scipy.stats import chi2_contingency
from IPython.display import display
from itertools import combinations
from collections import defaultdict
import IPython
import seaborn as sns
sns.set_theme()
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
custom_cmap = LinearSegmentedColormap.from_list("survival_cmap", ["tomato", "lightblue"])
import re
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, StandardScaler
from sklearn.inspection import permutation_importance
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.model_selection import learning_curve, validation_curve, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree, export_text
from xgboost import XGBClassifier
import xgboost as xgb
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import shap
Data Understanding¶
# Kaggle.com env
#train_df = pd.read_csv('/kaggle/input/titanic/train.csv')
#test_df = pd.read_csv('/kaggle/input/titanic/test.csv')
# Local Env
train_df = pd.read_csv('./input/train.csv')
test_df = pd.read_csv('./input/test.csv')
Data Dictionary¶
| Variable | Definition | Key, Example |
|---|---|---|
| Survived | Survival | 0 = No, 1 = Yes |
| PassengerId | Integer index of passenger | 0,1,2,3,... |
| Name | Name of passenger including title | Braund, Mr. Owen Harris |
| Pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| Sex | Sex | female, male |
| Age | Age in years | 0.15, 2, 15 |
| SibSp | # of siblings / spouses aboard the Titanic | 0,1,2,3,.. |
| Parch | # of parents / children aboard the Titanic | 0,1,2,3,.. |
| Ticket | Ticket number, some with prefixes, shared among groups | SW/PP 751 |
| Fare | Passenger fare, shared among groups | 7.91, 14.4542, 512.3292 |
| Cabin | Cabin number(s) listed on ticket, prefixed with deck letter, shared among groups | A20, "B57 B59 B63 B66" |
| Embarked | Port of Embarkation | C = Cherbourg (France), Q = Queenstown (Ireland), S = Southampton (England) |
Variable Notes¶
pclass: A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way...
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch=0 for them.
Descriptive Statistics¶
train_df.describe(include='number')
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|---|
| count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
| mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
| 50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
| max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
Initial inferences:
- PassengerId: column is most likely a sequential integer index from 1 to 891
- Survived: Mean survival rate for this data set is 38.4%
- Pclass: Top 25% of passengers were in 1st and 2nd Class, remainder in 3rd class
- Age: Youngest child was less than 6 months old, oldest is 80 years old. Median age 28 yrs, 75th pecentile 38 yrs. 177 values missing
- SibSp / Parch: At least 50% of passengers traveled alone, 75% passengers had at most 1 sibling or spouse. Max family size is 8
- Fare: Fares ranged from 0.00 (potentially crew?) to 512.32. Median fare 14.45, 75th percentile 31.00 - highest fares likely correlates with upper class
train_df.describe(include='object')
| Name | Sex | Ticket | Cabin | Embarked | |
|---|---|---|---|---|---|
| count | 891 | 891 | 891 | 204 | 889 |
| unique | 891 | 2 | 681 | 147 | 3 |
| top | Braund, Mr. Owen Harris | male | 347082 | B96 B98 | S |
| freq | 1 | 577 | 7 | 4 | 644 |
Initial Inferences:
- Name: Contains surname and title info, probable tokens for splitting
- Sex: Majority of passengers were male
- Ticket: Contains number; some contain prefixes with potentially useful info. 210 passengers shared the same ticket number, indicating traveling together (may be family and/or household staff)
- Cabin: Contains room number and deck letter prefix. Some cabin strings contain multiple cabin numbers, may also signal family or household staff relationships. 687 values missing
- Embarked: Majority of passengers embarked from Southampton, England (644). 2 values missing
Row Samples¶
train_df.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Data Types¶
train_df.dtypes
PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object
Missing Values Summary¶
train_df.isnull().sum().loc[lambda x: x > 0]
Age 177 Cabin 687 Embarked 2 dtype: int64
test_df.isnull().sum().loc[lambda x: x > 0]
Age 86 Fare 1 Cabin 327 dtype: int64
Data Preparation¶
Missing Value Imputation¶
Embarked¶
- Only 2 missing values in the data set, will impute based on available data
- Both passengers list the same cabin and ticket number, strongly suggesting they traveled together and embarked from the same port
- Historical records show Martha Evelyn Stone and (Rose) Amelie Icard embarked from Southampton (See 1,2 in References section)
- Amelie was maid to Mrs Stone
- This being said, to maximize predictive integrity of upcoming modeling we'll inspect data only available at train time instead.
- Passengers residing on Deck B embarked from either Southamption or Cherbourg, increasing the likelihood these two passengers embarked from one of those two locations.
- Verified Cabin B28 nor Ticket 113572 are not shared by any other passengers.
- Verified no other passengers sharing the same last name as Icard or Stone.
- Given 58.8% of 1st class passengers embarked from Southampton, it is more likely these passengers embarked from Southampton as well.
- Missing Embarked values will be imputed with the most common embarkation point of passengers of the same Pclass.
train_df[train_df['Embarked'].isnull()][['Name', 'Pclass', 'Ticket', 'Cabin', 'Age', 'Fare', 'Parch', 'SibSp']]
| Name | Pclass | Ticket | Cabin | Age | Fare | Parch | SibSp | |
|---|---|---|---|---|---|---|---|---|
| 61 | Icard, Miss. Amelie | 1 | 113572 | B28 | 38.0 | 80.0 | 0 | 0 |
| 829 | Stone, Mrs. George Nelson (Martha Evelyn) | 1 | 113572 | B28 | 62.0 | 80.0 | 0 | 0 |
train_df[train_df['Ticket'] == '113572'][['Name', 'Pclass', 'Ticket', 'Cabin', 'Age', 'Fare', 'Parch', 'SibSp']]
| Name | Pclass | Ticket | Cabin | Age | Fare | Parch | SibSp | |
|---|---|---|---|---|---|---|---|---|
| 61 | Icard, Miss. Amelie | 1 | 113572 | B28 | 38.0 | 80.0 | 0 | 0 |
| 829 | Stone, Mrs. George Nelson (Martha Evelyn) | 1 | 113572 | B28 | 62.0 | 80.0 | 0 | 0 |
train_df[train_df['Cabin'] == 'B28'][['Name', 'Pclass', 'Ticket', 'Cabin', 'Age', 'Fare', 'Parch', 'SibSp']]
| Name | Pclass | Ticket | Cabin | Age | Fare | Parch | SibSp | |
|---|---|---|---|---|---|---|---|---|
| 61 | Icard, Miss. Amelie | 1 | 113572 | B28 | 38.0 | 80.0 | 0 | 0 |
| 829 | Stone, Mrs. George Nelson (Martha Evelyn) | 1 | 113572 | B28 | 62.0 | 80.0 | 0 | 0 |
train_df[train_df['Name'].str.contains("Stone")][['Name', 'Pclass', 'Ticket', 'Cabin', 'Age', 'Fare', 'Parch', 'SibSp']]
| Name | Pclass | Ticket | Cabin | Age | Fare | Parch | SibSp | |
|---|---|---|---|---|---|---|---|---|
| 319 | Spedden, Mrs. Frederic Oakley (Margaretta Corn... | 1 | 16966 | E34 | 40.0 | 134.5 | 1 | 1 |
| 829 | Stone, Mrs. George Nelson (Martha Evelyn) | 1 | 113572 | B28 | 62.0 | 80.0 | 0 | 0 |
train_df[train_df['Name'].str.contains("Icard")][['Name', 'Pclass', 'Ticket', 'Cabin', 'Age', 'Fare', 'Parch', 'SibSp']]
| Name | Pclass | Ticket | Cabin | Age | Fare | Parch | SibSp | |
|---|---|---|---|---|---|---|---|---|
| 61 | Icard, Miss. Amelie | 1 | 113572 | B28 | 38.0 | 80.0 | 0 | 0 |
deck_df = train_df[['Cabin', 'Embarked']].copy()
deck_df['Deck'] = deck_df['Cabin'].str[0]
deck_df[deck_df['Deck'] == 'B']['Embarked'].value_counts()
Embarked S 23 C 22 Name: count, dtype: int64
first_class_df = train_df[train_df['Pclass'] == 1]
embarked_counts = first_class_df['Embarked'].value_counts(dropna=False).sort_index()
embarked_percent = (embarked_counts / embarked_counts.sum()) * 100
summary_df = pd.DataFrame({
'Count': embarked_counts,
'Percentage': embarked_percent.round(2)
})
summary_df.index.name = 'Embarked'
summary_df.reset_index(inplace=True)
print(summary_df)
Embarked Count Percentage 0 C 85 39.35 1 Q 2 0.93 2 S 127 58.80 3 NaN 2 0.93
train_df.groupby(['Embarked', 'Pclass'])['Fare'].median()
Embarked Pclass
C 1 78.2667
2 24.0000
3 7.8958
Q 1 90.0000
2 12.3500
3 7.7500
S 1 52.0000
2 13.5000
3 8.0500
Name: Fare, dtype: float64
train_df['Embarked'].describe() # Southamption was the most common embarkation point, used for fallback scenario
count 889 unique 3 top S freq 644 Name: Embarked, dtype: object
def impute_embarked(df):
"""
Imputs missing "Embarked" values with the most common Embarkation point from those of the same Pclass.
Sets missing value to "S" if mode can't be found (in scenario that all Embarked values for the current Pclass is missing (unlikely))
Args:
df (DataFrame): Data set to impute (either training or test data set)
Returns:
Updated DataFrame with imputed Embarked feature
"""
def impute_embarked_with_mode(row, df):
if pd.isna(row['Embarked']):
mode_value = df[df['Pclass'] == row['Pclass']]['Embarked'].mode()
return mode_value[0] if not mode_value.empty else 'S' # Default to 'S' if mode is not found
else:
return row['Embarked']
df['Embarked'] = df.apply(lambda row: impute_embarked_with_mode(row, df), axis=1)
return df
prepared_train_df = impute_embarked(train_df)
prepared_test_df = impute_embarked(test_df)
Cabin¶
- 1st class passengers are missing only 18.5% of cabin numbers vs 2nd and 3rd class passengers missing 91% and 97.5% of Cabin numbers respectively -- Suggests having a Cabin number is a socio-economic class indicator that should be captured.
- The known cabin values also have additional signals that should be captured separately:
- They contain more than one cabin designation (e.g. "B57 B59 B63 B66"):
- Number of cabins becomes an indirect family and wealth signal
- 1st class passengers had the most multi-cabins designation
- Number of cabins becomes an indirect family and wealth signal
- Each cabin token is prefixed with a single character, most likely being the ship deck where is is located
- The cabin number will also be extracted as a potential signal, as number corresponds to location of cabin on the ship
- They contain more than one cabin designation (e.g. "B57 B59 B63 B66"):
deck_df = train_df[['Pclass', 'Cabin']].copy()
missing_pct = deck_df.groupby('Pclass')['Cabin'].apply(lambda x: x.isna().mean() * 100).reset_index()
missing_pct.columns = ['Pclass', 'Missing_Cabin_Percentage']
print(missing_pct)
Pclass Missing_Cabin_Percentage 0 1 18.518519 1 2 91.304348 2 3 97.556008
def derive_features_from_cabin_then_drop(df):
"""
Creates FOUR new features based on contents of "Cabin" and then DROPS the "Cabin" column:
1) "HasCabin" (bool): true if passenger had Cabin value
2) "Cabin_count" (category): Number of cabins cited in passenger's Cabin value.
Set to 1 for passengers with no Cabin value.
3) "Deck" (category): Single letter identifying the deck where known Cabin was located (e.g. A, B C)
Set to 'M' for passengers with no Cabin value
4) "Cabin_Location_s" (category): String indicating whether cab is located on "port" or "starboard" side of boat
based on cabin number(s). Set to "both" if string contains multiple cabin
numbers that reside on both sides of boat
Args:
df (DataFrame): Data set to impute (either training or test data set)
Returns:
Nothing
"""
df['HasCabin'] = (df['Cabin'].notnull()).astype(int)
df['Cabin_count'] = df['Cabin'].apply(lambda x: 0 if pd.isna(x) else len(x.split()))
df['Cabin_count'] = df['Cabin_count'].astype('category')
# Extract first character from non-missing Cabin values; assign 'M' for missing Cabin values
df['Deck'] = df['Cabin'].apply(lambda x: 'M' if pd.isna(x) else x[0])
deck_order = sorted(df["Deck"].dropna().unique())
df["Deck"] = pd.Categorical(df["Deck"], categories=deck_order, ordered=True)
# Implement Cabin_Location_s
def determine_cabin_side(cabin_str):
if pd.isna(cabin_str):
return "no_cabin_info"
# Extract all numeric parts from the cabin string
cabin_numbers = re.findall(r'\d+', cabin_str)
if not cabin_numbers:
return "no_cabin_number"
cabin_nums = [int(num) for num in cabin_numbers]
all_even = all(num % 2 == 0 for num in cabin_nums)
all_odd = all(num % 2 != 0 for num in cabin_nums)
if all_even:
return "port"
elif all_odd:
return "starboard"
else:
return "port_and_starboard"
# Apply function to create the new feature
df['Cabin_Location_s'] = df['Cabin'].apply(determine_cabin_side).astype('category')
df.drop(columns="Cabin", inplace=True)
# Create prepared training and test data frames to be used for EDA and modeling
derive_features_from_cabin_then_drop(prepared_train_df)
derive_features_from_cabin_then_drop(prepared_test_df)
print(prepared_train_df[['HasCabin']].value_counts().sort_index())
print()
print(prepared_train_df[['Pclass', 'Cabin_count']].value_counts().sort_index())
print()
print(prepared_train_df[['Deck']].value_counts().sort_index())
print()
print(prepared_train_df[['Cabin_Location_s']].value_counts().sort_index())
HasCabin
0 687
1 204
Name: count, dtype: int64
Pclass Cabin_count
1 0 40
1 156
2 12
3 6
4 2
2 0 168
1 16
3 0 479
1 8
2 4
Name: count, dtype: int64
Deck
A 15
B 47
C 59
D 33
E 32
F 13
G 4
M 687
T 1
Name: count, dtype: int64
Cabin_Location_s
no_cabin_info 687
no_cabin_number 4
port 108
port_and_starboard 2
starboard 90
Name: count, dtype: int64
Age¶
- Dropping rows with missing age is clearly avoided given the large number of missing values relative to the total samples in the data set plus the availability of surrounding data to inform a calculated imputation strategy.
- Imputing the missing values with the mean/median age (29.7, 28 respectively) or some missing indicator (e.g. -1) is inferior to researching correlation strength between Age and the following features and taking the median age of each correlated group.
- The following correlation strengths were identified (direction ignored given label encoding order undefined):
- Pclass (0.42): Moderate correlation
- Title (0.32): Moderate correlation
- Given the above, Age will be imputed with the median age found for each group of passengers split by Pclass and Title.
# Determine correlation strength between Age and adjacent features
le = LabelEncoder()
df_corr = train_df[['Age', 'Sex', 'Pclass', 'Name']].copy()
df_corr['Sex_encoded'] = le.fit_transform(df_corr['Sex'])
df_corr['Title'] = df_corr['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False) + '.'
df_corr['Title_encoded'] = le.fit_transform(df_corr['Title'])
corr = df_corr[['Age', 'Sex_encoded', 'Pclass', 'Title_encoded']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr.loc[['Age']], annot=True, fmt=".2f", cmap='coolwarm', center=0, square=True)
plt.title("Feature Correlation Heatmap")
plt.show()
plt.figure(figsize=(12, 6))
sns.boxplot(data=df_corr, x="Title", y="Age", hue="Pclass")
plt.title("Median Age by Title differs per Pclass")
plt.show()
def impute_age(df):
"""
Imputes missing values in the "Age" column of the specified DataFrame with the median age
of the data grouped by "Pclass" and Title (extracted from "Name")
Adds "Title" column to the DataFrame as well.
Args:
df (DataFrame): Data set to impute (either training or test data set)
Returns:
Nothing
"""
# First attempt to impute by Pclass X Title
df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False) + '.'
pclass_title_age_median_map = df.groupby(['Pclass', 'Title'])['Age'].median()
def impute_age_by_pclass_title(row):
if pd.notna(row['Age']):
return row['Age']
return pclass_title_age_median_map.loc[row['Pclass'], row['Title']]
df['Age'] = df.apply(impute_age_by_pclass_title, axis=1)
# In the scenario where all ages for a given title are missing, impute with median age for Sex
sex_age_median_map = df.groupby(['Sex'])['Age'].median()
def impute_age_by_sex(row):
if pd.notna(row['Age']):
return row['Age']
return sex_age_median_map.loc[row['Sex']]
df['Age'] = df.apply(impute_age_by_sex, axis=1)
impute_age(prepared_train_df)
impute_age(prepared_test_df)
Fare¶
- We'll set missing Fare values to the median Fare of passengers from the same class and embarkation point
sns.boxplot(data=train_df, x="Pclass", y="Fare", hue="Embarked")
plt.show()
def impute_fare(df):
"""
Imputes missing values in the "Fare" column of the specified DataFrame with the median Fare
of passengers with the same Pclass and Embarked
Args:
df (DataFrame): Data set to impute (either training or test data set)
Returns:
Nothing
"""
df['Fare'] = df['Fare'].fillna(df.groupby(['Pclass', 'Embarked'])['Fare'].transform('median'))
impute_fare(prepared_train_df)
impute_fare(prepared_test_df)
# Confirm no more missing values
print(f"Missing Training Data Values:\n{prepared_train_df.isnull().sum().loc[lambda x: x > 0]}")
print(f"\nMissing Test Data Values:\n{prepared_test_df.isnull().sum().loc[lambda x: x > 0]}")
Missing Training Data Values: Series([], dtype: int64) Missing Test Data Values: Series([], dtype: int64)
Exploratory Data Analysis¶
def plot_vars(plot_df):
# Identify discrete and continuous columns
discrete_vars = [col for col in plot_df.columns
if plot_df[col].dtype in ['int64', 'int32', 'object', 'category']
and plot_df[col].nunique() <= 20]
continuous_vars = [col for col in plot_df.columns
if (plot_df[col].dtype in ['float64', 'float32']
or (plot_df[col].dtype in ['int64', 'int32'] and plot_df[col].nunique() > 20))]
# Combine for plotting
all_vars = discrete_vars + continuous_vars
n_cols = 3
n_rows = int(np.ceil(len(all_vars) / n_cols))
fig, axes = plt.subplots(n_rows, n_cols, figsize=(4 * n_cols, 3 * n_rows))
# Flatten axes for easy indexing
axes = axes.flatten()
for i, col in enumerate(all_vars):
if col in discrete_vars:
sns.histplot(plot_df[col], ax=axes[i], discrete=True, shrink=0.8)
else:
sns.histplot(plot_df[col], ax=axes[i], kde=True, bins=30)
axes[i].set_title(col)
axes[i].set_xlabel('')
axes[i].set_ylabel('Count')
# Hide any unused subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
# Plot base dataset
plot_vars(prepared_train_df)
Note: Name and Ticket are not plotted above but will be analyzed below. PassengerId will be ignored given its uniform distribution/definition.
Target¶
- Global survival rate: 38.4%
- Target is imbalanced 61.6% vs 38.4%
- Be sure to stratify target column during cross-validation
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Survived")
plt.show()
global_survival_rate = prepared_train_df['Survived'].mean()
print(f"Global Survival Rate: {global_survival_rate:.4f}")
Global Survival Rate: 0.3838
Individual Features x Target¶
- The relationship between each feature and the target class
Survivedis analyzed below. - The relationship between combinations of features and the target class is analyzed in the subsequent subsection titled "Composite Features x Target".
Pclass¶
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Pclass", hue="Survived")
plt.show()
survival_df = (
prepared_train_df
.groupby("Pclass")
.agg(Survival_Rate=('Survived', 'mean'), Count=('Survived', 'size'))
.reset_index()
.sort_values(by="Survival_Rate", ascending=False)
)
survival_df
| Pclass | Survival_Rate | Count | |
|---|---|---|---|
| 0 | 1 | 0.629630 | 216 |
| 1 | 2 | 0.472826 | 184 |
| 2 | 3 | 0.242363 | 491 |
- 1st class had highest survival rate (62%)
- 2nd class had marginal survival rate (47%)
- 3rd class had lowest survival rate (24%)
- Non-linear relationship, high-sample size per class, clear threshold split points for trees
Sex¶
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Sex", hue="Survived")
plt.show()
survival_df = (
prepared_train_df
.groupby("Sex")
.agg(Survival_Rate=('Survived', 'mean'), Count=('Survived', 'size'))
.reset_index()
.sort_values(by="Survival_Rate", ascending=False)
)
survival_df
| Sex | Survival_Rate | Count | |
|---|---|---|---|
| 0 | female | 0.742038 | 314 |
| 1 | male | 0.188908 | 577 |
- Most females survived (74.2%), consistent with "women and children first" evacuation protocol
- Vast majority of males perished (18.8%)
- Non-linear relationship, high-sample size per class, clear threshold split points for trees
SibSp¶
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="SibSp", hue="Survived")
plt.title("All SibSp Values")
plt.show()
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df[prepared_train_df['SibSp'] > 1], x="SibSp", hue="Survived")
plt.title("SibSp > 1")
plt.show()
survival_df = (
prepared_train_df
.groupby("SibSp")
.agg(Survival_Rate=('Survived', 'mean'), Count=('Survived', 'size'))
.reset_index()
.sort_values(by="SibSp", ascending=True)
)
survival_df
| SibSp | Survival_Rate | Count | |
|---|---|---|---|
| 0 | 0 | 0.345395 | 608 |
| 1 | 1 | 0.535885 | 209 |
| 2 | 2 | 0.464286 | 28 |
| 3 | 3 | 0.250000 | 16 |
| 4 | 4 | 0.166667 | 18 |
| 5 | 5 | 0.000000 | 5 |
| 6 | 8 | 0.000000 | 7 |
- Passengers with no siblings or spouses had a lower survival rate (34.5%) than those with 1 or 2 sibling/spouses (53.5& and 46.4%, respectively)
- Survival rate decreased as number of siblings/spouses increased from 3 to 4 (25%, 16.7%)
- Passengers with 5 or more sibling/spouses did not survive
- Non-Linear relationship: Survival rate increases from 0->1 and then decreases from 1->8
- Low sample sizes for
SibSp >= 2- Samples with Sibsp=>5 and =8 presumably came from the same family => Potentially rare case, risks overfitting
Parch¶
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Parch", hue="Survived")
plt.show()
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df[prepared_train_df['Parch'] > 2], x="Parch", hue="Survived")
plt.show()
survival_df = (
prepared_train_df
.groupby("Parch")
.agg(Survival_Rate=('Survived', 'mean'), Count=('Survived', 'size'))
.reset_index()
.sort_values(by="Parch", ascending=True)
)
survival_df
| Parch | Survival_Rate | Count | |
|---|---|---|---|
| 0 | 0 | 0.343658 | 678 |
| 1 | 1 | 0.550847 | 118 |
| 2 | 2 | 0.500000 | 80 |
| 3 | 3 | 0.600000 | 5 |
| 4 | 4 | 0.000000 | 4 |
| 5 | 5 | 0.200000 | 5 |
| 6 | 6 | 0.000000 | 1 |
- Passengers with no parents or children had lower survival rate (34.3%) to those with 1-3 parents/children
- Survival rate increased to 60% from 1 to 3, and then drops sharply to 0 from 4-6
- Non-Linear relationship: Survival rate increases from 0->1 and then decreases from 1->2,5.
- Low sample sizes for
Parch >= 3
Embarked_¶
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Embarked", hue="Survived")
plt.show()
survival_df = (
prepared_train_df
.groupby("Embarked")
.agg(Survival_Rate=('Survived', 'mean'), Count=('Survived', 'size'))
.reset_index()
.sort_values(by="Survival_Rate", ascending=False)
)
survival_df
| Embarked | Survival_Rate | Count | |
|---|---|---|---|
| 0 | C | 0.553571 | 168 |
| 1 | Q | 0.389610 | 77 |
| 2 | S | 0.339009 | 646 |
- Most people from Cherbourg survived (55.5%)
- Most people embarked from Southamption, which had the lowest survival rate (33.9%)
- Most people from Queensland perished, had 2nd lowest survival rate (39.0%)
- Clear survival threshold point by splitting on
Embarked == 'C' - High sample size for
CandSclasses, moderate sample size forQ
HasCabin¶
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="HasCabin", hue="Survived")
plt.xticks([0, 1], ["0", "1"])
plt.show()
survival_df = (
prepared_train_df
.groupby("HasCabin")
.agg(Survival_Rate=('Survived', 'mean'),
Count=('Survived', 'size'),
Pclass_1_Count=('Pclass', lambda x: (x == 1).sum()),
Pclass_2_Count=('Pclass', lambda x: (x == 2).sum()),
Pclass_3_Count=('Pclass', lambda x: (x == 3).sum())
)
.reset_index()
.sort_values(by="Survival_Rate", ascending=False)
)
survival_df
| HasCabin | Survival_Rate | Count | Pclass_1_Count | Pclass_2_Count | Pclass_3_Count | |
|---|---|---|---|---|---|---|
| 1 | 1 | 0.666667 | 204 | 176 | 16 | 12 |
| 0 | 0 | 0.299854 | 687 | 40 | 168 | 479 |
- Clear survival threshold split between those with a cabin number (67%) and those without one (30%)
- Majority of passengers with a cabin number were 1st class, aligning having a cabin number with social-economic status and higher survival rate
- Not having a cabin number is also associated with lower class and survival rate
Cabin_count¶
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Cabin_count", hue="Survived")
plt.title("All Cabin_count Values")
plt.show()
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df[prepared_train_df['Cabin_count'] != 0], x="Cabin_count", hue="Survived")
plt.title("Cabin_count > 0")
plt.show()
survival_df = (
prepared_train_df
.groupby("Cabin_count", observed=True)
.agg(Survival_Rate=('Survived', 'mean'),
Count=('Survived', 'size'),
Pclass_1_Count=('Pclass', lambda x: (x == 1).sum()),
Pclass_2_Count=('Pclass', lambda x: (x == 2).sum()),
Pclass_3_Count=('Pclass', lambda x: (x == 3).sum())
)
.reset_index()
.sort_values(by="Cabin_count", ascending=True)
)
survival_df
| Cabin_count | Survival_Rate | Count | Pclass_1_Count | Pclass_2_Count | Pclass_3_Count | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0.299854 | 687 | 40 | 168 | 479 |
| 1 | 1 | 0.677778 | 180 | 156 | 16 | 8 |
| 2 | 2 | 0.562500 | 16 | 12 | 0 | 4 |
| 3 | 3 | 0.500000 | 6 | 6 | 0 | 0 |
| 4 | 4 | 1.000000 | 2 | 2 | 0 | 0 |
- Those listing more than one cabin on their ticket had higher survival rates than those listing only one
- Those listing more than one cabin were mainly 1st class passengers
- Signal behavior is similar to Parch/Sibsp (i.e. a "family size" indicator)
- Low sample size for Cabin_count >= 2
Cabin_Location_s¶
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Cabin_Location_s", hue="Survived")
plt.show()
survival_df = (
prepared_train_df
.groupby("Cabin_Location_s", observed=True)
.agg(Survival_Rate=('Survived', 'mean'),
Count=('Survived', 'size'),
Pclass_1_Count=('Pclass', lambda x: (x == 1).sum()),
Pclass_2_Count=('Pclass', lambda x: (x == 2).sum()),
Pclass_3_Count=('Pclass', lambda x: (x == 3).sum())
)
.reset_index()
.sort_values(by="Count", ascending=False)
)
survival_df
| Cabin_Location_s | Survival_Rate | Count | Pclass_1_Count | Pclass_2_Count | Pclass_3_Count | |
|---|---|---|---|---|---|---|
| 0 | no_cabin_info | 0.299854 | 687 | 40 | 168 | 479 |
| 2 | port | 0.611111 | 108 | 96 | 6 | 6 |
| 4 | starboard | 0.733333 | 90 | 77 | 7 | 6 |
| 1 | no_cabin_number | 0.500000 | 4 | 1 | 3 | 0 |
| 3 | port_and_starboard | 1.000000 | 2 | 2 | 0 | 0 |
- Starboard-side cabins had higher success rate (73.3%), consistent with historical accounts of Officer Murdoch following protocol of "women and children first, and then men if space remained".
- Post-side cabins lower success rate (61%) consistent with historical accounts of Officer Lightoller only allowing women (even at the expense of leaving boat seats empty!) and children and declining most men.
- Missing cabin info had significantly lower success rate (30%) and was mostly comprised ofr 2nd and 3rd class passengers, suggesting missing cabin info correlated with passenger status
- High sample sizes for no_cabin_info, port, and starboard; very low sample size for port_and_starboard class
Deck¶
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df, x="Deck", hue="Survived")
plt.title("All Deck Values")
plt.show()
sns.countplot(data=prepared_train_df[prepared_train_df['Deck'] != 'M'], x="Deck", hue="Survived")
plt.title("Deck != M")
plt.show()
survival_df = (
prepared_train_df
.groupby("Deck", observed=True)
.agg(
Survival_Rate=('Survived', 'mean'),
Count=('Survived', 'size'),
Pclass_1_Count=('Pclass', lambda x: (x == 1).sum()),
Pclass_2_Count=('Pclass', lambda x: (x == 2).sum()),
Pclass_3_Count=('Pclass', lambda x: (x == 3).sum())
)
.reset_index()
.sort_values(by="Deck", ascending=True)
)
survival_df
| Deck | Survival_Rate | Count | Pclass_1_Count | Pclass_2_Count | Pclass_3_Count | |
|---|---|---|---|---|---|---|
| 0 | A | 0.466667 | 15 | 15 | 0 | 0 |
| 1 | B | 0.744681 | 47 | 47 | 0 | 0 |
| 2 | C | 0.593220 | 59 | 59 | 0 | 0 |
| 3 | D | 0.757576 | 33 | 29 | 4 | 0 |
| 4 | E | 0.750000 | 32 | 25 | 4 | 3 |
| 5 | F | 0.615385 | 13 | 0 | 8 | 5 |
| 6 | G | 0.500000 | 4 | 0 | 0 | 4 |
| 7 | M | 0.299854 | 687 | 40 | 168 | 479 |
| 8 | T | 0.000000 | 1 | 1 | 0 | 0 |
- Each deck had a different survival rate, some similar to others
- Decks A-B-C exclusive to 1st Class
- Decks B-D-E have similar high survival rates (~75%)
- Decks F-G only 2nd and 3rd class passengers and low sample sizes each
Title¶
plt.figure(figsize=(5,3))
sns.countplot(data=prepared_train_df.query("Title in ['Mr.', 'Mrs.', 'Miss.', 'Master.']"), x="Title", hue="Survived")
plt.title("Titles in (Mr., Mrs., Miss., Master.)")
plt.show()
plt.figure(figsize=(12,3))
sns.countplot(data=prepared_train_df.query("Title not in ['Mr.', 'Mrs.', 'Miss.', 'Master.']"), x="Title", hue="Survived")
plt.title("All other Titles")
plt.show()
survival_by_title_df = (
prepared_train_df
.groupby("Title")
.agg(Survival_Rate=('Survived', 'mean'),
Count=('Survived', 'size'),
Pclass_1_Count=('Pclass', lambda x: (x == 1).sum()),
Pclass_2_Count=('Pclass', lambda x: (x == 2).sum()),
Pclass_3_Count=('Pclass', lambda x: (x == 3).sum()))
.reset_index()
.sort_values(by="Count", ascending=False)
)
print(survival_by_title_df)
# Confirmation of age range for Master. title
print("\nConfirmation that 'Master.' Title ages reflect 'young boy': 0.42 - 12")
print(prepared_train_df.query("Title in ['Master.']")['Age'].describe())
# Confirmation Ms. title should be grouped with Miss. (unmarried) (assumming she would have traveled with spouse)
print("\nConfirmation Ms. title should be grouped with Miss. (unmarried)")
print(prepared_train_df.query("Title in ['Ms.']")[['Name', 'SibSp']])
Title Survival_Rate Count Pclass_1_Count Pclass_2_Count \
12 Mr. 0.156673 517 107 91
9 Miss. 0.697802 182 46 34
13 Mrs. 0.792000 125 42 41
8 Master. 0.575000 40 3 9
4 Dr. 0.428571 7 5 2
15 Rev. 0.000000 6 0 6
7 Major. 0.500000 2 2 0
1 Col. 0.500000 2 2 0
10 Mlle. 1.000000 2 2 0
11 Mme. 1.000000 1 1 0
14 Ms. 1.000000 1 0 1
0 Capt. 0.000000 1 1 0
6 Lady. 1.000000 1 1 0
5 Jonkheer. 0.000000 1 1 0
3 Don. 0.000000 1 1 0
2 Countess. 1.000000 1 1 0
16 Sir. 1.000000 1 1 0
Pclass_3_Count
12 319
9 102
13 42
8 28
4 0
15 0
7 0
1 0
10 0
11 0
14 0
0 0
6 0
5 0
3 0
2 0
16 0
Confirmation that 'Master.' Title ages reflect 'young boy': 0.42 - 12
count 40.000000
mean 4.516750
std 3.433651
min 0.420000
25% 1.750000
50% 4.000000
75% 7.250000
max 12.000000
Name: Age, dtype: float64
Confirmation Ms. title should be grouped with Miss. (unmarried)
Name SibSp
443 Reynaldo, Ms. Encarnacion 0
- Top 3 Titles (Mr., Miss., Mrs.) reasonably spread across passenger classes, with high sample sizes
- "Master" title relatively low sample size, but doesn't warrant binning with surrounding Titles given its implication of age
Age_¶
sns.displot(data=prepared_train_df, x="Age", hue="Survived", kde=True, bins=range(0, 85, 5))
plt.show()
prepared_train_df['Age_bin'] = pd.cut(prepared_train_df['Age'], bins=range(0, 85, 5))
survival_by_Age_bin_df = (
prepared_train_df
.groupby("Age_bin", observed=True)
.agg(
Survival_Rate=('Survived', 'mean'),
Count=('Survived', 'size')
)
.reset_index()
.sort_values(by="Age_bin")
)
print(survival_by_Age_bin_df)
Age_bin Survival_Rate Count 0 (0, 5] 0.687500 48 1 (5, 10] 0.350000 20 2 (10, 15] 0.578947 19 3 (15, 20] 0.403101 129 4 (20, 25] 0.354839 124 5 (25, 30] 0.251256 199 6 (30, 35] 0.462264 106 7 (35, 40] 0.379310 87 8 (40, 45] 0.454545 55 9 (45, 50] 0.400000 40 10 (50, 55] 0.416667 24 11 (55, 60] 0.388889 18 12 (60, 65] 0.285714 14 13 (65, 70] 0.000000 3 14 (70, 75] 0.000000 4 15 (75, 80] 1.000000 1
- First Age bin (0,5] has the highest survival rate (68.7%), consistent with the "women and children first" evacauation policy
- Survival rate decreases for Age Bins
(15,20]through(25-30], dropping to lowest bin rate of 25.1% - Low sample sizes for ages 40+, all have similar success rates w/ exception of
n=1 (75, 80]passenger
Age_Group¶
- Created domain-specific binned Age to assess survival relationship and capture the "Young Child" difference in survival rate found above
- Young Child survival rate greatest of all groups; consistent with "women and children first" evac protocol
- Survival rate amongst remaining groups does not appear significant, will investigate when exploring Composite Features x Target
def create_feature_Age_Group(train_df, test_df):
"""
Add an ordered categorical 'AgeGroup' feature to train and test DataFrames using revised domain-specific bins
designed to reduce KL divergence to ≤ 0.02.
Age bins:
- Young_Child: 0-5
- Child: 6–17
- Young_Adult: 18–29
- Adult: 30–59
- Senior: 60+
Args:
train_df (pd.DataFrame): Training dataset containing an 'Age' column.
test_df (pd.DataFrame): Test dataset containing an 'Age' column.
Modifies:
Adds an 'AgeGroup' column with ordered categorical values to both datasets.
"""
bins = [0, 5, 17, 29, 59, np.inf]
labels = ['Young_Child', 'Child', 'Young_Adult', 'Adult', 'Senior']
age_cat_type = pd.CategoricalDtype(categories=labels, ordered=True)
for df in [train_df, test_df]:
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)
df['Age_Group'] = df['Age_Group'].astype(age_cat_type)
create_feature_Age_Group(prepared_train_df, prepared_test_df)
sns.displot(data=prepared_train_df, x="Age_Group", hue="Survived", kde=True, bins=range(0, 85, 5))
plt.show()
survival_by_Age_Group_bin_df = (
prepared_train_df
.groupby("Age_Group", observed=True)
.agg(
Survival_Rate=('Survived', 'mean'),
Count=('Survived', 'size')
)
.reset_index()
.sort_values(by="Age_Group")
)
print(survival_by_Age_Group_bin_df)
Age_Group Survival_Rate Count 0 Young_Child 0.687500 48 1 Child 0.434783 69 2 Young_Adult 0.310606 396 3 Adult 0.423295 352 4 Senior 0.269231 26
Fare_¶
print("Key Detail: Fare is shared amongst passengers sharing same ticket number!")
print("Sample subset of duplicate Ticket/Fare combinations in the data:")
fare_duplicates = prepared_train_df.groupby("Ticket")[['Ticket', "Fare"]].value_counts().head(10).reset_index()
fare_duplicates.columns = ["Ticket Number", "Fare", "Number of Duplicates"]
print(fare_duplicates)
sns.displot(data=prepared_train_df, x="Fare", hue="Survived", kde=True)
plt.title("Fare (raw) Distribution")
plt.show()
prepared_train_df["Fare_log"] = np.log1p(prepared_train_df["Fare"])
prepared_test_df["Fare_log"] = np.log1p(prepared_test_df["Fare"])
sns.displot(data=prepared_train_df, x="Fare_log", hue="Survived", kde=True)
plt.title("Fare (log) Distribution")
plt.show()
prepared_train_df['Fare_log_bin'] = pd.cut(prepared_train_df['Fare_log'], bins=np.arange(0, 8, 0.5))
prepared_test_df['Fare_log_bin'] = pd.cut(prepared_test_df['Fare_log'], bins=np.arange(0, 8, 0.5))
survival_by_Fare_log_bin_df = (
prepared_train_df
.groupby("Fare_log_bin", observed=True)
.agg(
Survival_Rate=('Survived', 'mean'),
Count=('Survived', 'size')
)
.reset_index()
.sort_values(by="Fare_log_bin")
)
print(survival_by_Fare_log_bin_df)
Key Detail: Fare is shared amongst passengers sharing same ticket number! Sample subset of duplicate Ticket/Fare combinations in the data: Ticket Number Fare Number of Duplicates 0 110152 86.5000 3 1 110413 79.6500 3 2 110465 52.0000 2 3 110564 26.5500 1 4 110813 75.2500 1 5 111240 33.5000 1 6 111320 38.5000 1 7 111361 57.9792 2 8 111369 30.0000 1 9 111426 26.5500 1
Fare_log_bin Survival_Rate Count 0 (1.5, 2.0] 0.000000 3 1 (2.0, 2.5] 0.223496 349 2 (2.5, 3.0] 0.414286 140 3 (3.0, 3.5] 0.456647 173 4 (3.5, 4.0] 0.400000 70 5 (4.0, 4.5] 0.641026 78 6 (4.5, 5.0] 0.823529 34 7 (5.0, 5.5] 0.666667 18 8 (5.5, 6.0] 0.625000 8 9 (6.0, 6.5] 1.000000 3
- Important: Fare value reflects amount paid for all passengers sharing the same ticket (see "Ticket" EDA for more info on sharing). Here, fare refers to "aggregated fare".
- Aggregated Fare skewed heavily to right; log transformation performed to create and plot new
Fare_logfeature for analysis - Low sample sizes observed for
(1.5, 2.0]bin and bins>= (4.5, 5.0] - Fares also overlap across multiple classes
- Feature Engineering Plan:
- First, we should calculate a "Fare per person" to account for scenarios where listed fare is total paid amongst passengers sharing same ticket number.
- Bin the low sample size "Fare per person" ranges to increase generalizability
- Combine with
Pclassif needed to capture more context aroundFarebins and prevent generalization errors aroundFare-only patterns
Summary of Single Feature Relationship with Target¶
| Feature | Summary of Relationship with Target | Domain Implication / Further EDA |
|---|---|---|
| Pclass | P1: Most likely to survive, 63%; P2: Near equal, 47%, P3: Less likely to survive, 24% | Priority given to higher class, more affluent |
| Sex | male: Less likely, 19%, female: More likely, 74% | Priority given to women, consistent with "women and children first" evac |
| SibSp | 0: Less likely, 34.5%, 1: More likely, 53.6%, 2: Near equal, 46%, 3: low > 16->0% | Larger families potentially more difficult to evacuate from cabins, or perhaps larger families tended to be in lower ticket classes less likely to survive |
| Parch | Similar to Sibsp for 0->2 & 5->6; 3: higher survival rate (though small sample size, could be all same family= rare case) | See SibSp notes |
| Embarked | C: Had marginally higher success rate, 55%, Q: 39%, S: 34% | C may have most high class ticket holders; Q/S predominately lower class (check in next section) |
| HasCabin | Having a Cabin number: 66%, Not Having: 30% | Having a cabin number may be associated with high class or higher-higher class, check in next section |
| Cabin_count | Those with more than one cabin listed had success rates 50% and higher | May represent elite families and may be all one family |
| Cabin_Location_s | Those with cabin numbers listed on starboard side had 12% higher survival | Consistent with historical reports of starboard evac procedure being more lenient than port-side evac |
| Deck | Of those with cabin numbers, Decks B, D, and E had highest survival rates (~75%) | Decks may have had closer proximity to lifeboats and/or disproportionate number of female and/or P1 ticket holders |
| Title | "Mr." title had lowest survival, 15.6%; Miss./Mrs. had highest survival, 69.8%/79.2% respectively; Master. had 57.5%, marginally higher survival | Title implies both Sex and Age; High female title rates aligns with "women and children first", however marginally lower rates for children does not align (perhaps most children were lower class?) |
| Age | Only for ages (0,5] was survival high, 68.8%; (10,15] had second highest survival, 57.9%; all other bins had survival between 28.6% and 46% (with exception of sole surviving (75,80] passenger)) |
The high survival of (0,5] aligns with "women and children first" evac protocol; the remaining similar rates amongst the bins suggests that age was not a strong factor in influencing age; Created Age_Group to explore significance of Survival relationship in Composite Features x Target section |
| Fare/Fare_log | Log1p(Fare) values higher than 4.0 had the highest survival rates ranging from 64% to 100% | Higher fares an indication of social economic class |
Composite Feature x Target¶
def plot_survival_heatmap(df, feature_1, feature_2, target_col='Survived', cmap='Blues'):
"""
Plots a heatmap showing survival rate and sample size per combination of two categorical features.
Parameters:
df (pd.DataFrame): The dataset containing the features and target.
feature_1 (str): Feature for the y-axis.
feature_2 (str): Feature for the x-axis.
target_col (str): Binary target column, e.g. 'Survived'.
cmap (str): Seaborn colormap for the heatmap.
"""
# Compute survival rates and counts
grouped = df.groupby([feature_1, feature_2], observed=True)[target_col]
survival_rate = grouped.mean().unstack()
sample_size = grouped.count().unstack()
# Create string annotations like "0.73\n(n=88)"
annotations = survival_rate.copy().astype("object") # Prevent dtype warning
for i in survival_rate.index:
for j in survival_rate.columns:
rate = survival_rate.loc[i, j]
count = sample_size.loc[i, j]
if pd.notna(rate) and pd.notna(count):
annotations.loc[i, j] = f"{rate:.2f}\n(n={int(count)})"
else:
annotations.loc[i, j] = ""
# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(survival_rate, annot=annotations, fmt="", cmap=cmap, cbar=True, linewidths=0.5, linecolor='gray')
plt.title(f"Survival Rate by {feature_1} × {feature_2}")
plt.xlabel(feature_2)
plt.ylabel(feature_1)
plt.tight_layout()
plt.show()
Pclass x Sex¶
- Visually distinct differences in survival rates of different combinations of gender and ticket class
- Consistent with priority given to social class and "women (and children) first" evac policy
- Next Step: Chi-squared test survival association with Pclass x Sex; Explore crossing additional features given high sample sizes.
def create_feature_Pclass_Sex(train_df, test_df):
"""
Creates a composite feature 'Pclass_Sex' by combining:
- Pclass (1, 2, 3)
- Sex ('male' or 'female')
Example values: '1_male', '3_female'
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'Pclass_Sex' column to both dataframes)
"""
def combine(pclass, sex):
return f"{pclass}_{sex}"
for df in [train_df, test_df]:
df['Pclass_Sex'] = df.apply(
lambda row: combine(row['Pclass'], row['Sex']), axis=1
).astype(str)
print("Created 'Pclass_Sex' in train_df and test_df.")
create_feature_Pclass_Sex(prepared_train_df, prepared_test_df)
Created 'Pclass_Sex' in train_df and test_df.
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Sex', cmap=custom_cmap)
Pclass x Title¶
- P1 and P2 Young boys (
Master.) had 100% survival rate (beware of n=12 sample size)- This interaction shows significantly heightened survival to young males compared to crossing Pclass x Sex
- Survival rates amongst female titles (
Mrs.,Miss.) looks consistent to Pclass x Sex signal - Next Step: Chi-squared test association with Survival, investigate creating Master-specific breakout.
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Title', cmap=custom_cmap)
Pclass x Parch¶
- Survival rate increased for P1 and P2 passengers as
Parchincreased (ignoring groups with n < 10) - Survival rate of P3 passengers inceased from 0->1 and then decreased from 1->2
- Next Step: Chi-squared test association between
PclassandParch + SibSp.
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Parch', cmap=custom_cmap)
Pclass x SibSp¶
- Survival rate increased for P1 passengers as
Parchincreased (ignoring groups with n < 10) - Survival rate of P3 passengers inceased from 0->1 and then decreased from 1->2
- Next Step: Chi-squared test association between
PclassandParch + SibSp.
plot_survival_heatmap(prepared_train_df, 'Pclass', 'SibSp', cmap=custom_cmap)
Sex x Parch¶
- Males with at least one parent/child had near double survival rate of males traveling alone.
- Next Step: Chi-squared test association between
SexandParch; consideris_Male_Parch_0feature.
plot_survival_heatmap(prepared_train_df, 'Sex', 'Parch', cmap=custom_cmap)
Sex x SibSp¶
- Similar observation with
Parch - Males with at least one sibling or spouse had near double survival rate of males traveling alone.
- Next Step: Chi-squared test survival association with
SexandSibSp; consideris_Male_SibSp0feature.
plot_survival_heatmap(prepared_train_df, 'Sex', 'SibSp', cmap=custom_cmap)
Pclass x Embarked¶
- 3rd class passengers from Southampton had distinctly lowest survival rate (p = 0.19)
- 3rd class survival of Cherbourg and Queenstown equal and lower than 1st and 2nd classes (p = 0.38)
- Very few 1st and 2nd class passengers from Queenstown (n < 10)
- Cherbourg survival of 1st and 2nd class greater than Southamption
- Next Step: Chi-squared test survival association with Pclass x Embarked.
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Embarked', cmap=custom_cmap)
Sex x Embarked¶
- Males from Queenstown had a distinctly low survival rate (p = 0.07)
- Well supported across all groups (n > 30)
- Next Step: Chi-squared test survival association with Sex x Embarked.
plot_survival_heatmap(prepared_train_df, 'Sex', 'Embarked', cmap=custom_cmap)
Pclass x HasCabin¶
- 3rd class passengers without a cabin designation had a distinctly low survival rate (p = 0.24)
- 2nd class passengers with a cabin designation had a distinctly high survival rate (p = 0.81), mid sized sample
- Next Step: Chi-squared test survival association with Pclass x HasCabin.
plot_survival_heatmap(prepared_train_df, 'Pclass', 'HasCabin', cmap=custom_cmap)
Sex x HasCabin¶
- Females with a cabin designation had distinctly high survival rate (p = 0.94)
- Males without a cabin designation had a distinctly low survival rate (p = 0.14)
- Next Step: Chi-squared test survival association with Sex x HasCabin.
plot_survival_heatmap(prepared_train_df, 'Sex', 'HasCabin', cmap=custom_cmap)
Parch x HasCabin¶
- Those having between 0 and 2 parents/children and a cabin designation have distinctly higher survival rates.
- Next Step: Chi-squared test survival association with Parch x HasCabin.
plot_survival_heatmap(prepared_train_df, 'Parch', 'HasCabin', cmap=custom_cmap)
SibSp x HasCabin¶
- Similar story with
Parch - Those having between 0 and 2 siblings/spouses and a cabin designation have distinctly higher survival rates.
- Next Step: Chi-squared test survival association with SibSp x HasCabin. Investigate (Parch + SibSp) x HasCabin.
plot_survival_heatmap(prepared_train_df, 'SibSp', 'HasCabin', cmap=custom_cmap)
Embarked x HasCabin¶
- Passengers from Queenstown with a cabin designation have a distinctly higher survival (p = 0.75).
- Passengers from Southampton without a cabin designation have a distinctly lower survival (p = 0.27).
- Next Step: Chi-squared test survival association with Embarked x HasCabin.
plot_survival_heatmap(prepared_train_df, 'Embarked', 'HasCabin', cmap=custom_cmap)
Pclass x Cabin_count¶
- Looks redundant to Pclass x HasCabin
- Low sample sizes at for cabin counts >= 2 (n < 10)
- Next Step: Skip engineering for now.
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Cabin_count', cmap=custom_cmap)
Sex x Cabin_count¶
- Looks redundant to Sex x HasCabin
- Low sample sizes for Cabin_count >= 2 (n < 10)
- Next Step: Skip engineering for now.
plot_survival_heatmap(prepared_train_df, 'Sex', 'Cabin_count', cmap=custom_cmap)
Pclass x Cabin_Location_s¶
- P1 starboard passengers had higher survival than P1 port passengers (p = 0.74 vs p = 0.60).
- Sample sizes low for all scenarios
- No_cabin_info scenario redundant to Pclass x HasCabin
- Next Step: Chi-squared test survival association with Pclass x Cabin_Location_s
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Cabin_Location_s', cmap=custom_cmap)
Sex x Cabin_Location_s¶
- Survival rate does not differ significantly across cabin locations.
- Next Step: Skip engineering feature for now.
plot_survival_heatmap(prepared_train_df, 'Sex', 'Cabin_Location_s', cmap=custom_cmap)
Pclass x Deck_bin¶
- P1 Survival rates similar across (B, D, and E) and (A & M)
- All other sample sizes across non-M deck P2/P3 classes are too small (n < 10)
- Next Step: Chi-squared test survival association with Pclass x Deck_bin feature.
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Deck', cmap=custom_cmap)
def create_feature_Deck_bin(train_df, test_df):
"""
Creates a categorical deck access feature:
- High_Access: Decks A, B
- Medium_Access: Decks C, D
- Low_Access: Decks E,F,G M, T
"""
access_map = {
'A': 'AM',
'M': 'AM',
'B': 'BDE',
'C': 'C',
'D': 'BDE',
'E': 'BDE',
'F': 'Rare',
'G': 'Rare',
'T': 'Rare'
}
for df in [train_df, test_df]:
df['Deck_bin'] = df['Deck'].map(access_map).astype('category').astype(str)
create_feature_Deck_bin(prepared_train_df, prepared_test_df)
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Deck_bin', cmap=custom_cmap)
Sex x Deck_bin¶
- Survival rates similar across non-M decks
- M-deck survival rates redundant to Sex x HasCabin
- Sex x Deck_bin also appears redundant to Sex x HasCabin
- Next Step: Skip engineering feature for now.
plot_survival_heatmap(prepared_train_df, 'Sex', 'Deck', cmap=custom_cmap)
plot_survival_heatmap(prepared_train_df, 'Sex', 'Deck_bin', cmap=custom_cmap)
Parch x Deck_bin¶
- Those alone and with families residing on decks BDE had distinctly higher survival rates.
- Solo travelers also had hightened survival rate on Deck C.
- Next Step: Chi-squared test survival association with Parch x Deck_bin
plot_survival_heatmap(prepared_train_df, 'Parch', 'Deck_bin', cmap=custom_cmap)
SibSp x Deck_bin¶
- Similar story to
Parch - Survival rate for solo and family travelers significantly higher on Decks
BDEandC. - Next Step: Chi-squared test survival association with SibSp x Deck_bin
plot_survival_heatmap(prepared_train_df, 'SibSp', 'Deck_bin', cmap=custom_cmap)
Deck x Cabin_Location_s¶
- Sample sizes across classes not large enough.
- Next Step: Skip engineering feature for now.
plot_survival_heatmap(prepared_train_df, 'Deck', 'Cabin_Location_s', cmap=custom_cmap)
Pclass x Title_bin¶
- Redundant to Pclass x Sex, with smaller sample sizes
- Provides distinct survival rates between
MasterandMrtitles supported by n > 20. - Next Step: Chi-squared test survival association with Pclass x Title_bin
def create_feature_Title_bin(train_df, test_df):
def bin_title(title):
if title in ['Mr.', 'Don.', 'Sir.', 'Jonkheer.']:
return 'Mr'
elif title == 'Master.':
return 'Master'
elif title in ['Mrs.', 'Mme.', 'Lady.', 'Countess.']:
return 'Mrs'
elif title in ['Miss.', 'Ms.', 'Mlle.']:
return 'Miss'
else:
return 'Other'
for df in [train_df, test_df]:
df['Title_bin'] = df['Title'].apply(bin_title).astype('category').astype(str)
create_feature_Title_bin(prepared_train_df, prepared_test_df)
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Title_bin', cmap=custom_cmap)
Sex x Title_bin¶
- Provides distinct survival rates between
Master/MrandMiss/Mrs - Next Step: Chi-squared test survival association with Sex x Title_bin.
plot_survival_heatmap(prepared_train_df, 'Sex', 'Title_bin', cmap=custom_cmap)
Pclass x Age_Group¶
- Provides distinct survival rates across classes and groups.
- Next Step: Chi-squared test survival association with Pclass x Age_Group.
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Age_Group', cmap=custom_cmap)
Sex x Age_Group¶
- Provides distinct survival rates across classes and groups.
- Next Step: Chi-squared test survival association with Sex x Age_Group.
plot_survival_heatmap(prepared_train_df, 'Sex', 'Age_Group', cmap=custom_cmap)
Pclass x FPP_log_bin¶
- Relatively small variance across fares across ticket classes
- P2 passenger survival increased 6% from Fare Bin 4->5
- Next Step: Chi-squared test survival association.
def create_feature_Fare_per_person_log(train_df, test_df, target_col='Survived', n_splits=5, random_state=42):
"""
Creates Fare_Per_Person, Fare_Per_Person_log, and Fare_Per_Person_log_bin features.
This version prevents data leakage:
- For train_df: Fare per person is computed out-of-fold (ticket counts from K-1 folds only).
- For test_df: Fare per person is computed using ticket counts from full training data.
- All log features are cast to float32 to maintain dtype consistency.
Args:
train_df (DataFrame): Training dataset containing 'Fare' and 'Ticket' columns.
test_df (DataFrame): Test dataset containing 'Fare' and 'Ticket' columns.
target_col (str): Column used for stratification (default: 'Survived').
n_splits (int): Number of Stratified K-Folds.
random_state (int): Random seed for reproducibility.
Returns:
None. Modifies train_df and test_df in-place.
"""
oof_fare_per_person_log = pd.Series(index=train_df.index, dtype=np.float32)
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
for train_idx, val_idx in skf.split(train_df, train_df[target_col]):
fold_train = train_df.iloc[train_idx]
fold_val = train_df.iloc[val_idx]
fold_ticket_counts = fold_train['Ticket'].value_counts()
val_ticket_counts = train_df.loc[val_idx, 'Ticket'].map(fold_ticket_counts).fillna(1)
fare_per_person = train_df.loc[val_idx, 'Fare'] / val_ticket_counts
oof_fare_per_person_log.iloc[val_idx] = np.log1p(fare_per_person).astype(np.float32)
train_df['Fare_Per_Person_log'] = oof_fare_per_person_log
train_df['Fare_Per_Person'] = np.expm1(train_df['Fare_Per_Person_log']).astype(np.float32)
# Test set: use full train ticket counts
full_ticket_counts = train_df['Ticket'].value_counts()
test_ticket_counts = test_df['Ticket'].map(full_ticket_counts).fillna(1)
test_df['Fare_Per_Person'] = (test_df['Fare'] / test_ticket_counts).astype(np.float32)
test_df['Fare_Per_Person_log'] = np.log1p(test_df['Fare_Per_Person']).astype(np.float32)
# Binning based on training data distribution
bins = np.quantile(train_df['Fare_Per_Person_log'], q=np.linspace(0, 1, 6))
bins[0] = -np.inf
bins[-1] = np.inf
train_df['Fare_Per_Person_log_bin'] = pd.cut(train_df['Fare_Per_Person_log'], bins=bins, labels=False)
test_df['Fare_Per_Person_log_bin'] = pd.cut(test_df['Fare_Per_Person_log'], bins=bins, labels=False)
print("✅ Fare_Per_Person_log and bins added to train and test sets (leakage prevented, dtype safe).")
create_feature_Fare_per_person_log(prepared_train_df, prepared_test_df)
✅ Fare_Per_Person_log and bins added to train and test sets (leakage prevented, dtype safe).
def create_feature_Fare_per_person_log_bin(train_df, test_df):
"""
Creates "Fare_Per_Person", "Fare_Per_Person_log", and "Fare_Per_Person_log_bin" features.
Applies quantile-based binning to "Fare_Per_Person_log" using training data to minimize distribution shift.
Bin edges are expanded to include ±infinity to ensure all test values map to valid bins.
Args:
train_df (DataFrame): Training data set
test_df (DataFrame): Test data set
Returns:
Nothing
"""
# Calculate ticket counts from training data
training_ticket_counts = train_df['Ticket'].value_counts()
for df in [train_df, test_df]:
ticket_counts = df['Ticket'].map(training_ticket_counts).fillna(1)
df['Fare_Per_Person'] = df['Fare'] / ticket_counts
df['Fare_Per_Person_log'] = np.log1p(df['Fare_Per_Person'])
# Compute quantile-based bins from training set
qcut_bins = pd.qcut(train_df['Fare_Per_Person_log'], q=6, duplicates='drop', retbins=True)[1]
qcut_bins[0] = -np.inf # Extend first bin edge to -inf
qcut_bins[-1] = np.inf # Extend last bin edge to +inf
# Create bin labels
bin_labels = [f"Bin {i+1}" for i in range(len(qcut_bins) - 1)]
cat_dtype = pd.api.types.CategoricalDtype(categories=bin_labels, ordered=True)
for df in [train_df, test_df]:
df['FPP_log_bin'] = pd.cut(
df['Fare_Per_Person_log'],
bins=qcut_bins,
labels=bin_labels,
include_lowest=True
).astype(cat_dtype)
create_feature_Fare_per_person_log_bin(prepared_train_df, prepared_test_df)
plot_survival_heatmap(prepared_train_df, 'Pclass', 'FPP_log_bin', cmap=custom_cmap)
Sex x FPP_log_bin¶
- Reveals lower survival probabilities for females across FPP Bins 1 through 4.
- Next Step: Chi-squared test survival association with Sex x FPP_log_bin.
plot_survival_heatmap(prepared_train_df, 'Sex', 'FPP_log_bin', cmap=custom_cmap)
Pclass x Parch_SibSp¶
- Distinct survival rates across ticket classes and Parch_SibSp 0->2
- Next Steps: Chi-squared test survival association with Pclass x Parch_SibSp; Bin Parch_SibSp >= 4
def create_feature_Parch_SibSp(train_df, test_df):
for df in [train_df, test_df]:
df['Parch_SibSp'] = df['Parch'] + df['SibSp']
create_feature_Parch_SibSp(prepared_train_df, prepared_test_df)
plot_survival_heatmap(prepared_train_df, 'Pclass', 'Parch_SibSp', cmap=custom_cmap)
Sex x Parch_SibSp¶
- Distinct survival rates across sexes and Parch_SibSp 0->3
- Next Step: Chi-squared test survival association with Sex x Parch_SibSp
plot_survival_heatmap(prepared_train_df, 'Sex', 'Parch_SibSp', cmap=custom_cmap)
HasCabin x Parch_SibSp¶
- Survival behavior increased for passengers without cabin designations from Parch_Sibp 0->3
- Survival behavior increased more sharply from 0->1 then stayed similar
- Next Step: Bin Parch_SibSp 2+ or 3+ and chi-squared test survival association
plot_survival_heatmap(prepared_train_df, 'HasCabin', 'Parch_SibSp', cmap=custom_cmap)
Hi-Cardinality Features¶
Ticket¶
- We can assume that passengers sharing the same
Ticketare part of the same traveling group. - Ideas for Feature Engineering:
- Create a "ticket frequency" feature that counts the number of training fold training set passengers that share the same
Ticketvalues and map that frequency number to passengers in the test data set that share the same combination of feature values.
- Create a "ticket frequency" feature that counts the number of training fold training set passengers that share the same
ticket_counts = prepared_train_df['Ticket'].value_counts()
shared_counts = (
ticket_counts
.value_counts()
.rename_axis('Ticket_Frequency')
.reset_index(name='Passenger_Count')
.sort_values('Ticket_Frequency')
)
print(shared_counts)
Ticket_Frequency Passenger_Count 0 1 547 1 2 94 2 3 21 3 4 11 6 5 2 5 6 3 4 7 3
Feature Priority Based on EDA¶
- The following list is ordered by most potentially valuable insights / predictive power first.
- Given the long list and limited timeframe, a subset of identified features will be prioritized and iterated upon during Model Development phase.
| Feature | Domain Insights | Stats Notes |
|---|---|---|
| Pclass x Sex | Females: 1st/2nd class: +90% survival, 3rd: 50%; Males: Decreasing survival from 1st 37% to 3rd 14% | Clear survival differences between groups; high sample count support for further grouping |
| Pclass x Age_Group | Distinct survival rates across class and ages | Sample sizes border line low n ~ 10) |
| Pclass x HasCabin | Having a cabin designation increased survival across all classes (Range: 0.50 (P3), 0.81 (P2)) | Low sample sizes for P2/P3 HasCabin=True (n < 16) |
| Sex x HasCabin | Having a cabin designation boosted survival ~30% for both sexes | Well supported across 4 classes |
| Embarked x HasCabin | Having cabin designation boosted survival 10%-30% across the embarkation points | Low samples for Queenstown/HasCabin=True |
| HasCabin x Parch_SibSp | Survival rate increased at differing rates between have/have-not for 0->3 | Low samples for 3+ (n < 20) |
| Pclass x Parch_SibSp | Showed survival rate increased across classes for family sizes 0->3 | Sample sizes n < 20 for Parch_SibSp 2+ |
| Sex x Parch_SibSp | Showed survival rate increased across sexes for family sizes 0->3 | n < 20 for Parch_SibSp 3+ |
| Pclass x Embarked | Southamption 3rd class perished most (0.19, n=353), Cherbourg 1st class perished least (0.69, n=85) | Smooth feature to account for spare categories |
| Sex x Embarked | Queenstown males had lowest survival (0.07, n=41); Cherbourg females had highest (0.88, n=73) | Well supported (n > 30) across groups |
| Pclass x Deck_bin | P1 Survival rates similar for Decks BDE and AM; BDE Decks had highest survival (~75%) | Low samples for non-M non-P1 groups (n < 10) |
| Pclass x Cabin_Location_s | P1 passengers on starboard side had 10% higher survival than port | Low samples (n < 10) for other port/starboard classes |
| Pclass x Title | Shows 1st/2nd Male young boys had 100% survival | Sample size only 12, start with smoothed feature but may need to replace with boolean |
| Sex X FPP_log_bin | Lower survival probabilities for females across FPP Bins 1 through 4 (42%->74%) | Groups well supported |
Cross-Fold Distribution Shift Analysis¶
- For each variable, and features-to-be engineered, average Kullback-Leibler (KL) divergence is calculated between training and validation folds of the training set to quantify distribution shift and assess generalizability to unseen data.
- KL divergence calculated from training set data only to mitigate data leakage.
- Average KL divergence is calculated from the KL divergences between K-1 training folds and 1 "unseen" validation fold for 5 cross-validation iterations.
- Threshold for significant Cross-Fold (CF) Distribution Shift is KL >= 0.02
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from scipy.stats import entropy
def kl_divergence(p, q, smooth=1e-6):
"""KL divergence using scipy.stats.entropy with smoothing."""
p = np.asarray(p) + smooth
q = np.asarray(q) + smooth
return entropy(p, q)
def evaluate_feature_kl_divergence(
df,
feature_list,
target_col='Survived',
n_splits=5,
random_state=42,
auto_bin_strategy='quantile', # or 'uniform'
n_bins=10
):
"""
Evaluate KL divergence between train and val distributions for categorical or continuous features.
Args:
df (pd.DataFrame): Input DataFrame.
feature_list (list): List of feature names (str or list/tuple of two features).
target_col (str): Target column for stratification.
n_splits (int): Number of cross-validation folds.
random_state (int): Random seed.
auto_bin_strategy (str): 'quantile' or 'uniform' binning for continuous variables.
n_bins (int): Number of bins if binning is applied.
Returns:
styled (pd.io.formats.style.Styler): Highlighted summary table.
df_results (pd.DataFrame): Full results table.
"""
results = []
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
def compute_bin_edges(series):
if auto_bin_strategy == 'quantile':
return np.unique(np.quantile(series.dropna(), np.linspace(0, 1, n_bins + 1)))
elif auto_bin_strategy == 'uniform':
return np.linspace(series.min(), series.max(), n_bins + 1)
else:
raise ValueError("Invalid auto_bin_strategy: choose 'quantile' or 'uniform'")
def bin_if_continuous(series, bin_edges):
return pd.cut(series, bins=bin_edges, include_lowest=True)
for feature in feature_list:
kl_values = []
for train_idx, val_idx in skf.split(df, df[target_col]):
fold_train = df.iloc[train_idx].copy()
fold_val = df.iloc[val_idx].copy()
if isinstance(feature, (list, tuple)):
feat_name = " x ".join(feature)
for col in feature:
if pd.api.types.is_numeric_dtype(fold_train[col]) and fold_train[col].nunique() > 25:
print(f"Binning {feature}: {fold_train[feature].nunique()} unique values")
edges = compute_bin_edges(fold_train[col])
fold_train[col] = bin_if_continuous(fold_train[col], edges)
fold_val[col] = bin_if_continuous(fold_val[col], edges)
train_counts = fold_train.groupby(list(feature), observed=True).size()
val_counts = fold_val.groupby(list(feature), observed=True).size()
else:
feat_name = feature
if pd.api.types.is_numeric_dtype(fold_train[feature]) and fold_train[feature].nunique() > 25:
edges = compute_bin_edges(fold_train[feature])
fold_train[feature] = bin_if_continuous(fold_train[feature], edges)
fold_val[feature] = bin_if_continuous(fold_val[feature], edges)
train_counts = fold_train[feature].value_counts()
val_counts = fold_val[feature].value_counts()
# Normalize to distributions
train_dist = train_counts / train_counts.sum()
val_dist = val_counts / val_counts.sum()
# Align keys
all_keys = train_dist.index.union(val_dist.index)
train_dist = train_dist.reindex(all_keys, fill_value=0)
val_dist = val_dist.reindex(all_keys, fill_value=0)
kl = kl_divergence(train_dist, val_dist)
kl_values.append(kl)
result = {
'Feature': feat_name,
'Avg_KL_Divergence': np.mean(kl_values),
'Min_KL_Divergence': np.min(kl_values),
'Max_KL_Divergence': np.max(kl_values),
'Std_KL_Divergence': np.std(kl_values)
}
for i, val in enumerate(kl_values):
result[f'Fold_{i+1}_KL'] = val
results.append(result)
df_results = pd.DataFrame(results).sort_values(by='Avg_KL_Divergence', ascending=False)
def highlight_kl(s):
return ['background-color: yellow' if v >= 0.02 else '' for v in s]
styled = df_results.style.apply(highlight_kl, subset=['Avg_KL_Divergence'])
return styled, df_results
features_to_evaluate = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked', 'HasCabin', 'Cabin_count', 'Cabin_Location_s',
'Deck', 'Title', 'Age', 'Age_Group', 'Fare', 'FPP_log_bin', 'Parch_SibSp', ['Pclass', 'Sex'], ['Pclass', 'Title'], ['Pclass', 'Parch'],
['Pclass', 'SibSp'], ['Sex', 'Parch'], ['Sex', 'SibSp'], ['Pclass', 'Embarked'], ['Sex', 'Embarked'],
['Pclass', 'HasCabin'], ['SibSp', 'HasCabin'], ['Parch', 'HasCabin'], ['SibSp', 'HasCabin'],
['Embarked', 'HasCabin'], ['Pclass', 'Cabin_count'], ['Sex', 'Cabin_count'], ['Pclass', 'Cabin_Location_s'],
['Sex', 'Cabin_Location_s'], ['Pclass', 'Deck_bin'], ['Sex', 'Deck_bin'], ['Parch', 'Deck_bin'], ['SibSp', 'Deck_bin'],
['Deck', 'Cabin_Location_s'], ['Pclass', 'Title_bin'], ['Sex', 'Title_bin'], ['Pclass', 'Age_Group'], ['Sex', 'Age_Group'],
['Pclass', 'FPP_log_bin'], ['Sex', 'FPP_log_bin'], ['Pclass', 'Parch_SibSp'], ['Sex', 'Parch_SibSp']]
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, features_to_evaluate)
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 16 | Pclass x Title | 0.233912 | 0.151928 | 0.319446 | 0.053121 | 0.151928 | 0.226165 | 0.237274 | 0.319446 | 0.234747 |
| 34 | Parch x Deck_bin | 0.189282 | 0.087900 | 0.319458 | 0.092661 | 0.138525 | 0.280541 | 0.119985 | 0.319458 | 0.087900 |
| 44 | Sex x Parch_SibSp | 0.181378 | 0.048391 | 0.339622 | 0.110240 | 0.048391 | 0.259522 | 0.187058 | 0.339622 | 0.072297 |
| 9 | Title | 0.176547 | 0.093651 | 0.285688 | 0.064112 | 0.093651 | 0.191216 | 0.137166 | 0.285688 | 0.175013 |
| 43 | Pclass x Parch_SibSp | 0.165862 | 0.070471 | 0.253850 | 0.076717 | 0.077618 | 0.224291 | 0.070471 | 0.253850 | 0.203078 |
| 37 | Pclass x Title_bin | 0.156349 | 0.064196 | 0.266336 | 0.066432 | 0.064196 | 0.122950 | 0.175081 | 0.266336 | 0.153180 |
| 36 | Deck x Cabin_Location_s | 0.138029 | 0.088532 | 0.184896 | 0.035618 | 0.088532 | 0.171475 | 0.129813 | 0.184896 | 0.115430 |
| 30 | Pclass x Cabin_Location_s | 0.124354 | 0.060462 | 0.187865 | 0.042098 | 0.060462 | 0.136060 | 0.135184 | 0.102198 | 0.187865 |
| 20 | Sex x SibSp | 0.120758 | 0.020390 | 0.188893 | 0.067905 | 0.063143 | 0.188893 | 0.143508 | 0.187856 | 0.020390 |
| 18 | Pclass x SibSp | 0.113705 | 0.071795 | 0.211615 | 0.050058 | 0.086212 | 0.095968 | 0.102938 | 0.211615 | 0.071795 |
| 17 | Pclass x Parch | 0.108081 | 0.058864 | 0.159577 | 0.033182 | 0.119175 | 0.159577 | 0.111851 | 0.058864 | 0.090935 |
| 41 | Pclass x FPP_log_bin | 0.103770 | 0.006048 | 0.200233 | 0.065205 | 0.076166 | 0.093450 | 0.200233 | 0.142951 | 0.006048 |
| 35 | SibSp x Deck_bin | 0.103751 | 0.063224 | 0.205703 | 0.051705 | 0.079703 | 0.090234 | 0.079892 | 0.205703 | 0.063224 |
| 19 | Sex x Parch | 0.101998 | 0.071684 | 0.123776 | 0.022449 | 0.115695 | 0.123776 | 0.120930 | 0.071684 | 0.077904 |
| 32 | Pclass x Deck_bin | 0.097972 | 0.021305 | 0.156873 | 0.058168 | 0.036229 | 0.021305 | 0.156873 | 0.120449 | 0.155007 |
| 39 | Pclass x Age_Group | 0.092701 | 0.029594 | 0.189176 | 0.057705 | 0.120718 | 0.189176 | 0.080586 | 0.043430 | 0.029594 |
| 25 | Parch x HasCabin | 0.086801 | 0.033732 | 0.127121 | 0.036505 | 0.122509 | 0.127121 | 0.093794 | 0.033732 | 0.056850 |
| 28 | Pclass x Cabin_count | 0.080167 | 0.032598 | 0.210130 | 0.065470 | 0.050717 | 0.032598 | 0.055862 | 0.051528 | 0.210130 |
| 3 | Parch | 0.070782 | 0.014046 | 0.120372 | 0.035046 | 0.120372 | 0.083354 | 0.080111 | 0.014046 | 0.056025 |
| 26 | SibSp x HasCabin | 0.069913 | 0.025272 | 0.102681 | 0.026465 | 0.074920 | 0.025272 | 0.087253 | 0.102681 | 0.059437 |
| 24 | SibSp x HasCabin | 0.069913 | 0.025272 | 0.102681 | 0.026465 | 0.074920 | 0.025272 | 0.087253 | 0.102681 | 0.059437 |
| 29 | Sex x Cabin_count | 0.066488 | 0.028877 | 0.110650 | 0.027259 | 0.070893 | 0.110650 | 0.049367 | 0.028877 | 0.072651 |
| 38 | Sex x Title_bin | 0.065952 | 0.018694 | 0.238731 | 0.086440 | 0.022777 | 0.027864 | 0.018694 | 0.238731 | 0.021693 |
| 14 | Parch_SibSp | 0.062126 | 0.010545 | 0.171704 | 0.061770 | 0.011865 | 0.171704 | 0.010545 | 0.088703 | 0.027815 |
| 8 | Deck | 0.050886 | 0.023459 | 0.077066 | 0.020708 | 0.023459 | 0.056705 | 0.030404 | 0.077066 | 0.066794 |
| 21 | Pclass x Embarked | 0.050665 | 0.017766 | 0.071438 | 0.020350 | 0.036846 | 0.059818 | 0.071438 | 0.067456 | 0.017766 |
| 31 | Sex x Cabin_Location_s | 0.049701 | 0.026488 | 0.063176 | 0.012454 | 0.048990 | 0.063176 | 0.055036 | 0.026488 | 0.054816 |
| 23 | Pclass x HasCabin | 0.045572 | 0.005157 | 0.151005 | 0.054150 | 0.030186 | 0.005157 | 0.035636 | 0.005874 | 0.151005 |
| 33 | Sex x Deck_bin | 0.038423 | 0.010414 | 0.128073 | 0.044959 | 0.018159 | 0.010414 | 0.014698 | 0.128073 | 0.020769 |
| 42 | Sex x FPP_log_bin | 0.037669 | 0.007717 | 0.077980 | 0.024528 | 0.077980 | 0.051409 | 0.027482 | 0.023759 | 0.007717 |
| 40 | Sex x Age_Group | 0.037536 | 0.015729 | 0.066647 | 0.023436 | 0.018944 | 0.066647 | 0.065694 | 0.015729 | 0.020665 |
| 12 | Fare | 0.035025 | 0.018682 | 0.067847 | 0.017375 | 0.067847 | 0.036000 | 0.023872 | 0.018682 | 0.028724 |
| 7 | Cabin_Location_s | 0.035010 | 0.019988 | 0.062994 | 0.018240 | 0.020678 | 0.062994 | 0.020721 | 0.019988 | 0.050667 |
| 6 | Cabin_count | 0.029722 | 0.005008 | 0.073406 | 0.023047 | 0.023223 | 0.026385 | 0.020590 | 0.005008 | 0.073406 |
| 27 | Embarked x HasCabin | 0.028635 | 0.010627 | 0.050821 | 0.016619 | 0.014879 | 0.050821 | 0.046258 | 0.010627 | 0.020591 |
| 10 | Age | 0.027167 | 0.006812 | 0.036574 | 0.010653 | 0.036574 | 0.027190 | 0.034200 | 0.006812 | 0.031057 |
| 2 | SibSp | 0.023899 | 0.003725 | 0.070919 | 0.024648 | 0.004250 | 0.018335 | 0.022264 | 0.070919 | 0.003725 |
| 13 | FPP_log_bin | 0.020249 | 0.001980 | 0.046791 | 0.016337 | 0.046791 | 0.029615 | 0.016580 | 0.006279 | 0.001980 |
| 22 | Sex x Embarked | 0.018854 | 0.011960 | 0.029482 | 0.006015 | 0.017022 | 0.011960 | 0.029482 | 0.020665 | 0.015141 |
| 11 | Age_Group | 0.010305 | 0.003488 | 0.015674 | 0.004456 | 0.014158 | 0.015674 | 0.010868 | 0.003488 | 0.007339 |
| 15 | Pclass x Sex | 0.009868 | 0.004667 | 0.021669 | 0.006045 | 0.004667 | 0.008709 | 0.021669 | 0.007522 | 0.006772 |
| 4 | Embarked | 0.007985 | 0.003290 | 0.013197 | 0.003653 | 0.013197 | 0.004446 | 0.003290 | 0.009964 | 0.009027 |
| 0 | Pclass | 0.004215 | 0.000181 | 0.011979 | 0.004194 | 0.004356 | 0.003669 | 0.011979 | 0.000181 | 0.000889 |
| 1 | Sex | 0.003707 | 0.000391 | 0.008590 | 0.002726 | 0.000391 | 0.002468 | 0.008590 | 0.002939 | 0.004146 |
| 5 | HasCabin | 0.000622 | 0.000080 | 0.002048 | 0.000730 | 0.000552 | 0.000214 | 0.000214 | 0.000080 | 0.002048 |
Feature Engineering¶
Reduce Distribution Shift of Select Features¶
Pclass x Age_Group¶
- Given the large differences between Age distributions per ticket class, I'm forgoing attempting to create a feature that tries to identify similarities across them.
- See Pclass x Sex x Age_Group for Age_Group-based feature.
sns.kdeplot(data=prepared_train_df, x="Age", hue="Pclass")
plt.title("Large Age Distribution Differences between Ticket Classes")
plt.show()
Pclass_HasCabin¶
Creating (n < 30)-binned Pclass_HasCabin reduced KL acceptable levels (KL = 0.015919).
def create_feature_Pclass_HasCabin(train_df, test_df):
"""
Creates a composite feature 'Pclass_HasCabin' by combining:
- Pclass (1, 2, or 3)
- HasCabin (converted to 0/1)
Any combination with fewer than 30 samples in the training set is binned into 'Rare'.
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'Pclass_HasCabin' column to both dataframes)
"""
def combine(pclass, has_cabin):
return f"{pclass}_{int(has_cabin)}"
# Generate raw composite keys
train_keys = train_df.apply(lambda row: combine(row['Pclass'], row['HasCabin']), axis=1)
# Count frequencies and find common groups
value_counts = train_keys.value_counts()
common_groups = value_counts[value_counts >= 30].index
def assign_or_rare(pclass, has_cabin):
key = f"{pclass}_{int(has_cabin)}"
return key if key in common_groups else "Rare"
for df in [train_df, test_df]:
df['Pclass_HasCabin'] = df.apply(lambda row: assign_or_rare(row['Pclass'], row['HasCabin']), axis=1).astype(str)
print("Created 'Pclass_HasCabin' in train_df and test_df (groups < 30 binned to 'Rare').")
create_feature_Pclass_HasCabin(prepared_train_df, prepared_test_df)
Created 'Pclass_HasCabin' in train_df and test_df (groups < 30 binned to 'Rare').
prepared_train_df['Pclass_HasCabin'].value_counts()
Pclass_HasCabin 3_0 479 1_1 176 2_0 168 1_0 40 Rare 28 Name: count, dtype: int64
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Pclass_HasCabin'])
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Pclass_HasCabin | 0.015919 | 0.005076 | 0.035123 | 0.013076 | 0.028302 | 0.005076 | 0.035123 | 0.005405 | 0.005687 |
Sex x HasCabin¶
Created Sex_HasCabin - exhibited negligble CF distribution shift (KL = 0.005846).
def create_feature_Sex_HasCabin(train_df, test_df):
"""
Creates a composite feature 'Sex_HasCabin' by combining:
- Sex (e.g., 'male' or 'female')
- HasCabin (converted to 0 or 1)
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'Sex_HasCabin' column to both dataframes)
"""
def combine(sex, has_cabin):
return f"{sex}_{int(has_cabin)}"
for df in [train_df, test_df]:
df['Sex_HasCabin'] = df.apply(lambda row: combine(row['Sex'], row['HasCabin']), axis=1).astype(str)
print("Created 'Sex_HasCabin' in train_df and test_df.")
create_feature_Sex_HasCabin(prepared_train_df, prepared_test_df)
Created 'Sex_HasCabin' in train_df and test_df.
prepared_train_df['Sex_HasCabin'].value_counts()
Sex_HasCabin male_0 470 female_0 217 male_1 107 female_1 97 Name: count, dtype: int64
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Sex_HasCabin'])
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Sex_HasCabin | 0.005846 | 0.003257 | 0.009534 | 0.002385 | 0.003257 | 0.004473 | 0.009534 | 0.004221 | 0.007743 |
Embarked x HasCabin¶
Created Embarked_HasCabin - Exhibited CF distribution shift (KL = 0.028635) due to rare Q_1 scenario. Will keep as it will be excluded during smoothed feature translation.
def create_feature_Embarked_HasCabin(train_df, test_df):
"""
Creates a composite feature 'Embarked_HasCabin' by combining:
- Embarked (C, Q, S)
- HasCabin (as 0 or 1)
Example values: 'S_1', 'C_0'
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'Embarked_HasCabin' column to both dataframes)
"""
def combine(embarked, has_cabin):
return f"{embarked}_{int(has_cabin)}"
for df in [train_df, test_df]:
df['Embarked_HasCabin'] = df.apply(
lambda row: combine(row['Embarked'], row['HasCabin']), axis=1
).astype(str)
print("Created 'Embarked_HasCabin' in train_df and test_df.")
create_feature_Embarked_HasCabin(prepared_train_df, prepared_test_df)
Created 'Embarked_HasCabin' in train_df and test_df.
prepared_train_df['Embarked_HasCabin'].value_counts()
Embarked_HasCabin S_0 515 S_1 131 C_0 99 Q_0 73 C_1 69 Q_1 4 Name: count, dtype: int64
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Embarked_HasCabin'])
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Embarked_HasCabin | 0.028635 | 0.010627 | 0.050821 | 0.016619 | 0.014879 | 0.050821 | 0.046258 | 0.010627 | 0.020591 |
Parch_SibSp_bin¶
Creating new Parch_SibSp_bin reduced Cross-Fold (CF) Distribution Shift to negligible levels (KL = 0.011754).
def create_feature_Parch_SibSp_bin(train_df, test_df):
"""
Creates a binned version of the 'Parch_SibSp' feature:
- Values >= 4 → '4+'
- All other values are converted to strings of their actual value
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'Parch_SibSp_bin' column to both dataframes)
"""
def bin_value(x):
return '4+' if x >= 4 else str(x)
for df in [train_df, test_df]:
df['Parch_SibSp_bin'] = df['Parch_SibSp'].apply(bin_value).astype(str)
print("Created 'Parch_SibSp_bin' in train_df and test_df with '4+' bin for values >= 4.")
create_feature_Parch_SibSp_bin(prepared_train_df, prepared_test_df)
Created 'Parch_SibSp_bin' in train_df and test_df with '4+' bin for values >= 4.
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Parch_SibSp_bin'])
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Parch_SibSp_bin | 0.011754 | 0.006239 | 0.020432 | 0.004932 | 0.006239 | 0.020432 | 0.008108 | 0.010876 | 0.013114 |
HasCabin x Parch_SibSp_bin¶
Created HasCabin_Parch_SibSp_bin, binning Parch_SibSp_bin with 3+; Exhibited negligble CF distribution shift (KL = 0.019378).
def create_feature_HasCabin_Parch_SibSp_bin(train_df, test_df):
"""
Creates a composite feature 'HasCabin_Parch_SibSp_bin' by:
- Summing Parch + SibSp
- Binning the result into '0', '1', '2', or '3+'
- Concatenating with HasCabin (converted to 0 or 1)
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'HasCabin_Parch_SibSp_bin' to both dataframes)
"""
def bin_family_size(n):
if n == 0:
return '0'
elif n == 1:
return '1'
elif n == 2:
return '2'
else:
return '3+'
def combine(has_cabin, parch, sibsp):
family_size_bin = bin_family_size(parch + sibsp)
return f"{int(has_cabin)}_{family_size_bin}"
for df in [train_df, test_df]:
df['HasCabin_Parch_SibSp_bin'] = df.apply(
lambda row: combine(row['HasCabin'], row['Parch'], row['SibSp']), axis=1
).astype(str)
print("Created 'HasCabin_Parch_SibSp_bin' in train_df and test_df.")
create_feature_HasCabin_Parch_SibSp_bin(prepared_train_df, prepared_test_df)
Created 'HasCabin_Parch_SibSp_bin' in train_df and test_df.
prepared_train_df['HasCabin_Parch_SibSp_bin'].value_counts()
HasCabin_Parch_SibSp_bin 0_0 443 0_1 95 1_0 94 0_3+ 76 0_2 73 1_1 66 1_2 29 1_3+ 15 Name: count, dtype: int64
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['HasCabin_Parch_SibSp_bin'])
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | HasCabin_Parch_SibSp_bin | 0.019378 | 0.006888 | 0.029617 | 0.008861 | 0.013843 | 0.029221 | 0.006888 | 0.029617 | 0.017321 |
Pclass x Parch_SibSp_bin¶
Created Pclass_Parch_SibSp_bin, binning Parch_SibSp_bin with 1+; Exhibited negligble CF distribution shift (KL = 0.013836).
def create_feature_Pclass_Parch_SibSp_bin(train_df, test_df):
"""
Creates a composite feature 'Pclass_Parch_SibSp_bin' by:
- Summing Parch + SibSp
- Binning the sum into '0' or '1+'
- Concatenating with Pclass
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'Pclass_Parch_SibSp_bin' to both dataframes)
"""
def bin_family_size(n):
return '0' if n == 0 else '1+'
def combine(pclass, parch, sibsp):
family_size_bin = bin_family_size(parch + sibsp)
return f"{pclass}_{family_size_bin}"
for df in [train_df, test_df]:
df['Pclass_Parch_SibSp_bin'] = df.apply(
lambda row: combine(row['Pclass'], row['Parch'], row['SibSp']), axis=1
).astype(str)
print("Created 'Pclass_Parch_SibSp_bin' in train_df and test_df.")
create_feature_Pclass_Parch_SibSp_bin(prepared_train_df, prepared_test_df)
Created 'Pclass_Parch_SibSp_bin' in train_df and test_df.
prepared_train_df['Pclass_Parch_SibSp_bin'].value_counts()
Pclass_Parch_SibSp_bin 3_0 324 3_1+ 167 1_0 109 1_1+ 107 2_0 104 2_1+ 80 Name: count, dtype: int64
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Pclass_Parch_SibSp_bin'])
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Pclass_Parch_SibSp_bin | 0.013836 | 0.001541 | 0.024130 | 0.007736 | 0.013611 | 0.019396 | 0.024130 | 0.010499 | 0.001541 |
survival_df = (
prepared_train_df
.groupby("Pclass_Parch_SibSp_bin", observed=True)
.agg(Survival_Rate=('Survived', 'mean'),
Count=('Survived', 'size')
)
.reset_index()
.sort_values(by="Pclass_Parch_SibSp_bin", ascending=True)
)
survival_df
| Pclass_Parch_SibSp_bin | Survival_Rate | Count | |
|---|---|---|---|
| 0 | 1_0 | 0.532110 | 109 |
| 1 | 1_1+ | 0.728972 | 107 |
| 2 | 2_0 | 0.346154 | 104 |
| 3 | 2_1+ | 0.637500 | 80 |
| 4 | 3_0 | 0.212963 | 324 |
| 5 | 3_1+ | 0.299401 | 167 |
Sex x Parch_SibSp_bin¶
Created Sex_Parch_SibSp_bin, binning Parch_SibSp_bin with 1+; Exhibited negligble CF distribution shift (KL = 0.015822).
def create_feature_Sex_Parch_SibSp_bin(train_df, test_df):
"""
Creates a composite feature 'Sex_Parch_SibSp_bin' by:
- Summing Parch + SibSp
- Binning the sum into '0' or '1+'
- Concatenating with Sex (e.g., 'male_1+', 'female_0')
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'Sex_Parch_SibSp_bin' to both dataframes)
"""
def bin_family_size(n):
return '0' if n == 0 else '1+'
def combine(sex, parch, sibsp):
family_size_bin = bin_family_size(parch + sibsp)
return f"{sex}_{family_size_bin}"
for df in [train_df, test_df]:
df['Sex_Parch_SibSp_bin'] = df.apply(
lambda row: combine(row['Sex'], row['Parch'], row['SibSp']), axis=1
).astype(str)
print("Created 'Sex_Parch_SibSp_bin' in train_df and test_df.")
create_feature_Sex_Parch_SibSp_bin(prepared_train_df, prepared_test_df)
Created 'Sex_Parch_SibSp_bin' in train_df and test_df.
prepared_train_df['Sex_Parch_SibSp_bin'].value_counts()
Sex_Parch_SibSp_bin male_0 411 female_1+ 188 male_1+ 166 female_0 126 Name: count, dtype: int64
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Sex_Parch_SibSp_bin'])
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Sex_Parch_SibSp_bin | 0.015822 | 0.000836 | 0.033512 | 0.012422 | 0.000836 | 0.033512 | 0.027176 | 0.007020 | 0.010564 |
survival_df = (
prepared_train_df
.groupby("Sex_Parch_SibSp_bin", observed=True)
.agg(Survival_Rate=('Survived', 'mean'),
Count=('Survived', 'size')
)
.reset_index()
.sort_values(by="Sex_Parch_SibSp_bin", ascending=True)
)
survival_df
| Sex_Parch_SibSp_bin | Survival_Rate | Count | |
|---|---|---|---|
| 0 | female_0 | 0.785714 | 126 |
| 1 | female_1+ | 0.712766 | 188 |
| 2 | male_0 | 0.155718 | 411 |
| 3 | male_1+ | 0.271084 | 166 |
Pclass x Embarked¶
Created (n < 30)-binned Pclass_Embarked - exhibited negligble CF distribution shift (KL = 0.018094).
def create_feature_Pclass_Embarked(train_df, test_df):
"""
Creates a composite feature 'Pclass_Embarked' by combining:
- Pclass (as an integer)
- Embarked (as a character)
Any combination that appears fewer than 20 times in the training set
is binned into the 'Rare' category.
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'Pclass_Embarked' column to both dataframes)
"""
def combine(pclass, embarked):
return f"{pclass}_{embarked}"
# Build composite keys for the training set
train_keys = train_df.apply(lambda row: combine(row['Pclass'], row['Embarked']), axis=1)
# Count occurrences and identify common groups
value_counts = train_keys.value_counts()
common_groups = value_counts[value_counts >= 20].index
def assign_or_rare(pclass, embarked):
key = f"{pclass}_{embarked}"
return key if key in common_groups else 'Rare'
for df in [train_df, test_df]:
df['Pclass_Embarked'] = df.apply(
lambda row: assign_or_rare(row['Pclass'], row['Embarked']), axis=1
).astype(str)
print("Created 'Pclass_Embarked' in train_df and test_df (groups < 20 binned to 'Rare').")
create_feature_Pclass_Embarked(prepared_train_df, prepared_test_df)
Created 'Pclass_Embarked' in train_df and test_df (groups < 20 binned to 'Rare').
prepared_train_df['Pclass_Embarked'].value_counts()
Pclass_Embarked 3_S 353 2_S 164 1_S 129 1_C 85 3_Q 72 3_C 66 Rare 22 Name: count, dtype: int64
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Pclass_Embarked'])
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Pclass_Embarked | 0.018094 | 0.011404 | 0.029573 | 0.007008 | 0.029573 | 0.011629 | 0.015340 | 0.022525 | 0.011404 |
Sex x Embarked¶
Created (n < 30)-binned Sex_Embarked - exhibited negligble CF distribution shift (KL = 0.018854).
def create_feature_Sex_Embarked(train_df, test_df):
"""
Creates a composite feature 'Sex_Embarked' by combining:
- Sex (e.g., 'male', 'female')
- Embarked (e.g., 'C', 'Q', 'S')
Any combination that appears fewer than 30 times in the training set
is binned into the 'Rare' category.
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'Sex_Embarked' column to both dataframes)
"""
def combine(sex, embarked):
return f"{sex}_{embarked}"
# Build composite keys in the training set
train_keys = train_df.apply(lambda row: combine(row['Sex'], row['Embarked']), axis=1)
# Count and identify common combinations
value_counts = train_keys.value_counts()
common_groups = value_counts[value_counts >= 30].index
def assign_or_rare(sex, embarked):
key = f"{sex}_{embarked}"
return key if key in common_groups else 'Rare'
for df in [train_df, test_df]:
df['Sex_Embarked'] = df.apply(
lambda row: assign_or_rare(row['Sex'], row['Embarked']), axis=1
).astype(str)
print("Created 'Sex_Embarked' in train_df and test_df (groups < 30 binned to 'Rare').")
create_feature_Sex_Embarked(prepared_train_df, prepared_test_df)
Created 'Sex_Embarked' in train_df and test_df (groups < 30 binned to 'Rare').
prepared_train_df['Sex_Embarked'].value_counts()
Sex_Embarked male_S 441 female_S 205 male_C 95 female_C 73 male_Q 41 female_Q 36 Name: count, dtype: int64
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Sex_Embarked'])
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Sex_Embarked | 0.018854 | 0.011960 | 0.029482 | 0.006015 | 0.017022 | 0.011960 | 0.029482 | 0.020665 | 0.015141 |
Pclass x Deck_bin¶
- Created (n < 10)-binned
Pclass_Deck_bin- exhibited negligble CF distribution shift (KL = 0.013757). - Decks
BDEandAMwere binned together due to similar survival rates in EDA.
def create_feature_Pclass_Deck_bin(train_df, test_df):
"""
Creates a composite feature 'Pclass_Deck_bin' by:
- Binning Deck into: 'BDE', 'AM', or 'Other'
- Concatenating with Pclass
- Binning combinations with < 10 occurrences into 'Rare'
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'Pclass_Deck_bin' to both dataframes)
"""
def bin_deck(deck):
if deck in ['B', 'D', 'E']:
return 'BDE'
elif deck in ['A', 'M']:
return 'AM'
else:
return 'Other'
def combine(pclass, deck):
return f"{pclass}_{bin_deck(deck)}"
# Build composite keys from training set
train_keys = train_df.apply(lambda row: combine(row['Pclass'], row['Deck']), axis=1)
value_counts = train_keys.value_counts()
common_groups = value_counts[value_counts >= 10].index
def assign_or_rare(pclass, deck):
key = f"{pclass}_{bin_deck(deck)}"
return key if key in common_groups else "Rare"
for df in [train_df, test_df]:
df['Pclass_Deck_bin'] = df.apply(
lambda row: assign_or_rare(row['Pclass'], row['Deck']), axis=1
).astype(str)
print("Created 'Pclass_Deck_bin' in train_df and test_df (groups < 10 binned to 'Rare').")
create_feature_Pclass_Deck_bin(prepared_train_df, prepared_test_df)
Created 'Pclass_Deck_bin' in train_df and test_df (groups < 10 binned to 'Rare').
prepared_train_df['Pclass_Deck_bin'].value_counts()
Pclass_Deck_bin 3_AM 479 2_AM 168 1_BDE 101 1_Other 60 1_AM 55 Rare 28 Name: count, dtype: int64
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Pclass_Deck_bin'])
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Pclass_Deck_bin | 0.013757 | 0.006021 | 0.021299 | 0.005887 | 0.021299 | 0.007718 | 0.017449 | 0.016297 | 0.006021 |
Pclass x Cabin_Location_s¶
Creating (n < 10)-binned Pclass_Cabin_Location_s reduced distribution shift to negligble levels (KL = 0.017970).
def create_feature_Pclass_Cabin_Location_s(train_df, test_df):
"""
Creates a composite feature 'Pclass_Cabin_Location_s' by combining:
- Pclass (1, 2, 3)
- Cabin_Location_s (e.g., 'port', 'starboard', 'unknown')
Groups with fewer than 10 occurrences in the training set are binned into 'Rare'.
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'Pclass_Cabin_Location_s' column to both dataframes)
"""
def combine(pclass, cabin_location):
return f"{pclass}_{cabin_location}"
# Compute composite keys in train
train_keys = train_df.apply(lambda row: combine(row['Pclass'], row['Cabin_Location_s']), axis=1)
value_counts = train_keys.value_counts()
common_groups = value_counts[value_counts >= 10].index
def assign_or_rare(pclass, cabin_location):
key = f"{pclass}_{cabin_location}"
return key if key in common_groups else "Rare"
for df in [train_df, test_df]:
df['Pclass_Cabin_Location_s'] = df.apply(
lambda row: assign_or_rare(row['Pclass'], row['Cabin_Location_s']), axis=1
).astype(str)
print("Created 'Pclass_Cabin_Location_s' in train_df and test_df (groups < 10 binned to 'Rare').")
create_feature_Pclass_Cabin_Location_s(prepared_train_df, prepared_test_df)
Created 'Pclass_Cabin_Location_s' in train_df and test_df (groups < 10 binned to 'Rare').
prepared_train_df['Pclass_Cabin_Location_s'].value_counts()
Pclass_Cabin_Location_s 3_no_cabin_info 479 2_no_cabin_info 168 1_port 96 1_starboard 77 1_no_cabin_info 40 Rare 31 Name: count, dtype: int64
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Pclass_Cabin_Location_s'])
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Pclass_Cabin_Location_s | 0.017970 | 0.006337 | 0.035650 | 0.011163 | 0.026185 | 0.006337 | 0.035650 | 0.012567 | 0.009113 |
Pclass x Normalized Title¶
Created Pclass_Title_bin - Exhibited CF distribution shift (KL = 0.025850) given low sample 12_Master category. Keeping for now to potentially benefit from the group; will ablation test during Model Development phase.
def create_feature_Pclass_Title_normalized(train_df, test_df):
"""
Creates a composite feature 'Pclass_Title_normalized' by:
- Normalizing Title based on Sex, SibSp, and Age
- Concatenating with Pclass
- Merging '1_Master' and '2_Master' into '12_Master'
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'Pclass_Title_normalized' column to both dataframes)
"""
def normalize_title(row):
title = row['Title']
sex = row['Sex']
sibsp = row['SibSp']
age = row['Age']
if title in ['Mr', 'Mrs', 'Miss', 'Master']:
return title
if sex == 'male':
if pd.notna(age) and age < 14:
return 'Master'
return 'Mr'
else:
return 'Mrs' if sibsp > 0 else 'Miss'
def assign_pclass_title(row):
pclass = row['Pclass']
title = row['Normalized_Title']
if title == 'Master' and pclass in [1, 2]:
return '12_Master'
return f"{pclass}_{title}"
for df in [train_df, test_df]:
df['Normalized_Title'] = df.apply(normalize_title, axis=1)
df['Pclass_Title_normalized'] = df.apply(assign_pclass_title, axis=1).astype(str)
print("Created 'Pclass_Title_normalized' in train_df and test_df with '12_Master' merged.")
create_feature_Pclass_Title_normalized(prepared_train_df, prepared_test_df)
Created 'Pclass_Title_normalized' in train_df and test_df with '12_Master' merged.
prepared_train_df['Pclass_Title_normalized'].value_counts()
Pclass_Title_normalized 3_Mr 318 1_Mr 119 2_Mr 99 3_Miss 81 3_Mrs 63 1_Miss 49 1_Mrs 45 2_Miss 44 2_Mrs 32 3_Master 29 12_Master 12 Name: count, dtype: int64
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Pclass_Title_normalized'])
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Pclass_Title_normalized | 0.025850 | 0.010231 | 0.039694 | 0.011595 | 0.010231 | 0.039694 | 0.038210 | 0.024290 | 0.016827 |
Deck_bin¶
- Created
Deck_bin- Exhibited negligible CF distribution shift (KL = 0.005735). - Decks
BDEandAMwere binned together due to similar survival rates in EDA.
def create_feature_Deck_bin(train_df, test_df):
"""
Creates a binned feature 'Deck_bin' from the 'Deck' column:
- 'BDE' if Deck is B, D, or E
- 'AM' if Deck is A or M
- 'Other' for all other values (including NaN)
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'Deck_bin' to both dataframes)
"""
def bin_deck(deck):
if deck in ['B', 'D', 'E']:
return 'BDE'
elif deck in ['A', 'M']:
return 'AM'
else:
return 'Other'
for df in [train_df, test_df]:
df['Deck_bin'] = df['Deck'].apply(bin_deck).astype(str)
print("Created 'Deck_bin' in train_df and test_df.")
create_feature_Deck_bin(prepared_train_df, prepared_test_df)
Created 'Deck_bin' in train_df and test_df.
prepared_train_df['Deck_bin'].value_counts()
Deck_bin AM 702 BDE 112 Other 77 Name: count, dtype: int64
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Deck_bin'])
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Deck_bin | 0.005735 | 0.000509 | 0.021163 | 0.007751 | 0.001993 | 0.002214 | 0.000509 | 0.021163 | 0.002795 |
survival_df = (
prepared_train_df
.groupby("Deck_bin", observed=True)
.agg(Survival_Rate=('Survived', 'mean'),
Count=('Survived', 'size')
)
.reset_index()
.sort_values(by="Deck_bin", ascending=True)
)
survival_df
| Deck_bin | Survival_Rate | Count | |
|---|---|---|---|
| 0 | AM | 0.303419 | 702 |
| 1 | BDE | 0.750000 | 112 |
| 2 | Other | 0.584416 | 77 |
Title_normalized¶
- Created
Title_normalized- Exhibited negligible CF distribution shift (KL = 0.011534 ). - Rare titles are merged into
Mr/Mrs/Miss/Mastergroups based on theirSex,SibSp, andAge.
def create_feature_Title_normalized(train_df, test_df):
"""
Creates a normalized title feature 'Title_normalized' using:
- Original Title
- Sex
- SibSp
- Age
Logic:
- Keep 'Mr', 'Mrs', 'Miss', and 'Master' as-is
- For all other titles:
- If male and Age < 14 → 'Master'
- If male → 'Mr'
- If female and SibSp > 0 → 'Mrs'
- If female and SibSp == 0 → 'Miss'
Args:
train_df (pd.DataFrame): Training set.
test_df (pd.DataFrame): Test set.
Returns:
None (adds 'Title_normalized' to both dataframes)
"""
def normalize_title(row):
title = row['Title']
sex = row['Sex']
sibsp = row['SibSp']
age = row['Age']
if title in ['Mr', 'Mrs', 'Miss', 'Master']:
return title
if sex == 'male':
if pd.notna(age) and age < 14:
return 'Master'
return 'Mr'
else:
return 'Mrs' if sibsp > 0 else 'Miss'
for df in [train_df, test_df]:
df['Title_normalized'] = df.apply(normalize_title, axis=1).astype(str)
print("Created 'Title_normalized' in train_df and test_df.")
create_feature_Title_normalized(prepared_train_df, prepared_test_df)
Created 'Title_normalized' in train_df and test_df.
prepared_train_df['Title_normalized'].value_counts()
Title_normalized Mr 536 Miss 174 Mrs 140 Master 41 Name: count, dtype: int64
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, ['Title_normalized'])
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Title_normalized | 0.011534 | 0.001920 | 0.025101 | 0.008669 | 0.001920 | 0.025101 | 0.018127 | 0.006011 | 0.006509 |
Pclass_Sex One-Hot Encodings¶
def create_Pclass_Sex_one_hot_encodings(train_df, test_df):
for df in [train_df, test_df]:
dummies = pd.get_dummies(df['Pclass_Sex'], prefix='Pclass_Sex')
df[dummies.columns] = dummies
print(f"{len(dummies)} one-hot encodings created for Pclass x Sex: {dummies.columns}")
return dummies.columns
pclass_sex_oh_cols = create_Pclass_Sex_one_hot_encodings(train_df, test_df)
891 one-hot encodings created for Pclass x Sex: Index(['Pclass_Sex_1_female', 'Pclass_Sex_1_male', 'Pclass_Sex_2_female',
'Pclass_Sex_2_male', 'Pclass_Sex_3_female', 'Pclass_Sex_3_male'],
dtype='object')
418 one-hot encodings created for Pclass x Sex: Index(['Pclass_Sex_1_female', 'Pclass_Sex_1_male', 'Pclass_Sex_2_female',
'Pclass_Sex_2_male', 'Pclass_Sex_3_female', 'Pclass_Sex_3_male'],
dtype='object')
Negligible distribution shift for all created Pclass x Sex one-hot encodings (all KL < 0.02)
styled, _ = evaluate_feature_kl_divergence(prepared_train_df, pclass_sex_oh_cols)
display(styled)
| Feature | Avg_KL_Divergence | Min_KL_Divergence | Max_KL_Divergence | Std_KL_Divergence | Fold_1_KL | Fold_2_KL | Fold_3_KL | Fold_4_KL | Fold_5_KL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Pclass_Sex_2_male | 0.003808 | 0.000597 | 0.013918 | 0.005155 | 0.000688 | 0.000597 | 0.013918 | 0.000597 | 0.003240 |
| 0 | Pclass_Sex_1_female | 0.002241 | 0.000013 | 0.004184 | 0.001328 | 0.002324 | 0.002189 | 0.004184 | 0.002497 | 0.000013 |
| 4 | Pclass_Sex_3_female | 0.002012 | 0.000001 | 0.006726 | 0.002623 | 0.000001 | 0.000271 | 0.006726 | 0.000010 | 0.003055 |
| 1 | Pclass_Sex_1_male | 0.001723 | 0.000029 | 0.004394 | 0.001594 | 0.001223 | 0.002564 | 0.000029 | 0.004394 | 0.000404 |
| 2 | Pclass_Sex_2_female | 0.001264 | 0.000205 | 0.003621 | 0.001215 | 0.000527 | 0.003621 | 0.000205 | 0.000982 | 0.000982 |
| 5 | Pclass_Sex_3_male | 0.000355 | 0.000048 | 0.000762 | 0.000326 | 0.000762 | 0.000738 | 0.000048 | 0.000048 | 0.000182 |
Survival Association Tests¶
- Chi-squared tests are run against selected global and pclass_sex subgrouped features to determine which have a statistically-significant association with Survival Rate (p < 0.05).
- Features are then sorted by descending Cramer's V value (strength of association) to prioritize testing during Model Development.
def chi2_test_features_against_survival_with_cramers_v(df, feature_list, target_col='Survived', alpha=0.05):
"""
Perform chi-squared tests between a list of categorical features and the target column (Survived).
Also calculates Cramér's V to indicate the strength of the association.
Args:
df (pd.DataFrame): DataFrame containing the data.
feature_list (list): List of feature column names to test.
target_col (str): The target column to test association with.
alpha (float): Significance level for determining statistical significance.
Returns:
pd.DataFrame: DataFrame summarizing chi-squared test results with Cramér's V, sorted by descending Cramér's V.
"""
results = []
for feature in feature_list:
contingency = pd.crosstab(df[feature], df[target_col])
n = contingency.sum().sum()
if contingency.shape[0] < 2 or contingency.shape[1] < 2:
results.append({
'Feature': feature,
'Chi2 Statistic': np.nan,
'p-value': np.nan,
'Cramer\'s V': np.nan,
'Significant': False
})
continue
chi2, p, dof, expected = chi2_contingency(contingency)
k = min(contingency.shape)
cramers_v = np.sqrt(chi2 / (n * (k - 1))) if k > 1 else np.nan
results.append({
'Feature': feature,
'Chi2 Statistic': chi2,
'p-value': p,
'Cramer\'s V': cramers_v,
'Significant': p < alpha
})
results_df = pd.DataFrame(results).sort_values(by="Cramer\'s V", ascending=False)
# Style output to highlight statistically significant rows in green
def highlight_significant(row):
color = 'background-color: lightgreen' if row['Significant'] else ''
return [color] * len(row)
styled = results_df.style.apply(highlight_significant, axis=1)
display(styled)
return results_df
Global Feature Survival Association Tests¶
global_features_to_eval = [
'Pclass_HasCabin',
'Sex_HasCabin',
'Embarked_HasCabin',
'Parch_SibSp_bin',
'HasCabin_Parch_SibSp_bin',
'Pclass_Parch_SibSp_bin',
'Sex_Parch_SibSp_bin',
'Pclass_Embarked',
'Sex_Embarked',
'Pclass_Deck_bin',
'Pclass_Cabin_Location_s',
'Pclass_Title_normalized',
'Deck_bin',
'Title_normalized',
'Pclass_Sex_1_female',
'Pclass_Sex_1_male',
'Pclass_Sex_2_female',
'Pclass_Sex_2_male',
'Pclass_Sex_3_female',
'Pclass_Sex_3_male',
'Pclass_Sex'
]
# Run chi-squared + Cramér's V analysis
results_df = chi2_test_features_against_survival_with_cramers_v(prepared_train_df, global_features_to_eval)
| Feature | Chi2 Statistic | p-value | Cramer's V | Significant | |
|---|---|---|---|---|---|
| 11 | Pclass_Title_normalized | 400.105514 | 0.000000 | 0.670114 | True |
| 20 | Pclass_Sex | 350.675308 | 0.000000 | 0.627356 | True |
| 1 | Sex_HasCabin | 315.679272 | 0.000000 | 0.595229 | True |
| 13 | Title_normalized | 292.273628 | 0.000000 | 0.572738 | True |
| 8 | Sex_Embarked | 278.911706 | 0.000000 | 0.559493 | True |
| 6 | Sex_Parch_SibSp_bin | 271.402109 | 0.000000 | 0.551909 | True |
| 14 | Pclass_Sex_1_female | 148.919875 | 0.000000 | 0.408825 | True |
| 19 | Pclass_Sex_3_male | 146.550069 | 0.000000 | 0.405559 | True |
| 5 | Pclass_Parch_SibSp_bin | 131.446775 | 0.000000 | 0.384093 | True |
| 9 | Pclass_Deck_bin | 123.775185 | 0.000000 | 0.372716 | True |
| 4 | HasCabin_Parch_SibSp_bin | 121.671640 | 0.000000 | 0.369535 | True |
| 7 | Pclass_Embarked | 120.638493 | 0.000000 | 0.367963 | True |
| 10 | Pclass_Cabin_Location_s | 120.366297 | 0.000000 | 0.367548 | True |
| 0 | Pclass_HasCabin | 117.021729 | 0.000000 | 0.362405 | True |
| 2 | Embarked_HasCabin | 103.202699 | 0.000000 | 0.340335 | True |
| 16 | Pclass_Sex_2_female | 98.919730 | 0.000000 | 0.333198 | True |
| 12 | Deck_bin | 95.786717 | 0.000000 | 0.327879 | True |
| 3 | Parch_SibSp_bin | 77.587742 | 0.000000 | 0.295092 | True |
| 17 | Pclass_Sex_2_male | 25.563777 | 0.000000 | 0.169384 | True |
| 18 | Pclass_Sex_3_female | 9.222372 | 0.002391 | 0.101738 | True |
| 15 | Pclass_Sex_1_male | 0.070848 | 0.790106 | 0.008917 | False |
Pclass x Sex Subgroup Feature Survival Association Tests¶
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
from IPython.display import display
def cramers_v_stat(chi2, n, k):
return np.sqrt(chi2 / (n * (k - 1))) if k > 1 else np.nan
def chi2_test_features_by_pclass_sex(df, feature_list, target_col='Survived', alpha=0.05):
"""
Perform chi-squared tests and Cramér's V for each feature within each Pclass x Sex subgroup.
Args:
df (pd.DataFrame): DataFrame containing features and target.
feature_list (list): List of categorical feature column names to evaluate.
target_col (str): Name of the binary target column. Default is 'Survived'.
alpha (float): Significance level. Default is 0.05.
Returns:
pd.DataFrame: Styled DataFrame sorted by Pclass, Sex, and descending Cramér's V.
"""
results = []
for pclass in sorted(df['Pclass'].dropna().unique()):
for sex in sorted(df['Sex'].dropna().unique()):
subgroup_df = df[(df['Pclass'] == pclass) & (df['Sex'] == sex)]
for feature in feature_list:
contingency = pd.crosstab(subgroup_df[feature], subgroup_df[target_col])
n = contingency.sum().sum()
if contingency.shape[0] < 2 or contingency.shape[1] < 2:
results.append({
'Feature': feature,
'Pclass': pclass,
'Sex': sex,
'Chi2 Statistic': np.nan,
'p-value': np.nan,
'Cramer\'s V': np.nan,
'Significant': False
})
continue
chi2, p, dof, _ = chi2_contingency(contingency)
k = min(contingency.shape)
v = cramers_v_stat(chi2, n, k)
results.append({
'Feature': feature,
'Pclass': pclass,
'Sex': sex,
'Chi2 Statistic': chi2,
'p-value': p,
'Cramer\'s V': v,
'Significant': p < alpha
})
results_df = pd.DataFrame(results).sort_values(
by=["Pclass", "Sex", "Cramer\'s V"],
ascending=[True, True, False]
)
def highlight_significant(row):
return ['background-color: lightgreen' if row['Significant'] else '' for _ in row]
styled = results_df.style.apply(highlight_significant, axis=1)
display(styled)
return results_df
pclass_sex_subgroup_features_to_eval = [
'Parch_SibSp_bin',
'Embarked',
'HasCabin',
'Cabin_Location_s',
'Deck_bin',
'Title_normalized',
'Age_Group',
'FPP_log_bin'
]
results_df = chi2_test_features_by_pclass_sex(prepared_train_df, pclass_sex_subgroup_features_to_eval)
| Feature | Pclass | Sex | Chi2 Statistic | p-value | Cramer's V | Significant | |
|---|---|---|---|---|---|---|---|
| 6 | Age_Group | 1 | female | 31.269190 | 0.000003 | 0.576759 | True |
| 0 | Parch_SibSp_bin | 1 | female | 30.219349 | 0.000004 | 0.566994 | True |
| 4 | Deck_bin | 1 | female | 7.689866 | 0.021388 | 0.286019 | True |
| 3 | Cabin_Location_s | 1 | female | 1.170785 | 0.760020 | 0.111603 | False |
| 1 | Embarked | 1 | female | 0.243108 | 0.885543 | 0.050855 | False |
| 7 | FPP_log_bin | 1 | female | 0.128096 | 0.720414 | 0.036915 | False |
| 5 | Title_normalized | 1 | female | 0.005622 | 0.940233 | 0.007733 | False |
| 2 | HasCabin | 1 | female | 0.000000 | 1.000000 | 0.000000 | False |
| 14 | Age_Group | 1 | male | 10.257159 | 0.036312 | 0.289957 | True |
| 8 | Parch_SibSp_bin | 1 | male | 7.099910 | 0.130702 | 0.241238 | False |
| 15 | FPP_log_bin | 1 | male | 4.541498 | 0.103235 | 0.192939 | False |
| 11 | Cabin_Location_s | 1 | male | 4.361506 | 0.224981 | 0.189077 | False |
| 12 | Deck_bin | 1 | male | 2.851519 | 0.240326 | 0.152883 | False |
| 13 | Title_normalized | 1 | male | 2.850271 | 0.091359 | 0.152849 | False |
| 10 | HasCabin | 1 | male | 2.444523 | 0.117936 | 0.141552 | False |
| 9 | Embarked | 1 | male | 0.887638 | 0.641582 | 0.085298 | False |
| 23 | FPP_log_bin | 2 | female | 3.725582 | 0.444416 | 0.221406 | False |
| 16 | Parch_SibSp_bin | 2 | female | 1.231122 | 0.872948 | 0.127275 | False |
| 22 | Age_Group | 2 | female | 1.228990 | 0.746060 | 0.127165 | False |
| 20 | Deck_bin | 2 | female | 0.987013 | 0.610482 | 0.113961 | False |
| 17 | Embarked | 2 | female | 0.875053 | 0.645631 | 0.107303 | False |
| 19 | Cabin_Location_s | 2 | female | 0.659575 | 0.882668 | 0.093159 | False |
| 18 | HasCabin | 2 | female | 0.000000 | 1.000000 | 0.000000 | False |
| 21 | Title_normalized | 2 | female | 0.000000 | 1.000000 | 0.000000 | False |
| 30 | Age_Group | 2 | male | 49.511789 | 0.000000 | 0.677084 | True |
| 29 | Title_normalized | 2 | male | 45.854146 | 0.000000 | 0.651595 | True |
| 27 | Cabin_Location_s | 2 | male | 16.443728 | 0.000269 | 0.390201 | True |
| 24 | Parch_SibSp_bin | 2 | male | 15.727944 | 0.001289 | 0.381614 | True |
| 28 | Deck_bin | 2 | male | 13.050838 | 0.001466 | 0.347622 | True |
| 26 | HasCabin | 2 | male | 8.689608 | 0.003200 | 0.283654 | True |
| 31 | FPP_log_bin | 2 | male | 3.969421 | 0.264785 | 0.191713 | False |
| 25 | Embarked | 2 | male | 0.329199 | 0.848234 | 0.055210 | False |
| 32 | Parch_SibSp_bin | 3 | female | 22.482968 | 0.000161 | 0.395135 | True |
| 33 | Embarked | 3 | female | 14.448617 | 0.000729 | 0.316761 | True |
| 39 | FPP_log_bin | 3 | female | 7.657619 | 0.104956 | 0.230603 | False |
| 38 | Age_Group | 3 | female | 7.165236 | 0.127410 | 0.223066 | False |
| 37 | Title_normalized | 3 | female | 5.530864 | 0.018684 | 0.195982 | True |
| 35 | Cabin_Location_s | 3 | female | 2.028986 | 0.362586 | 0.118702 | False |
| 36 | Deck_bin | 3 | female | 1.228986 | 0.540915 | 0.092383 | False |
| 34 | HasCabin | 3 | female | 0.173913 | 0.676657 | 0.034752 | False |
| 45 | Title_normalized | 3 | male | 13.878592 | 0.000195 | 0.199990 | True |
| 44 | Deck_bin | 3 | male | 13.427928 | 0.001214 | 0.196716 | True |
| 46 | Age_Group | 3 | male | 12.709396 | 0.012787 | 0.191381 | True |
| 40 | Parch_SibSp_bin | 3 | male | 11.409147 | 0.022331 | 0.181327 | True |
| 47 | FPP_log_bin | 3 | male | 8.616516 | 0.071433 | 0.157580 | False |
| 41 | Embarked | 3 | male | 4.719183 | 0.094459 | 0.116619 | False |
| 43 | Cabin_Location_s | 3 | male | 2.753371 | 0.252414 | 0.089077 | False |
| 42 | HasCabin | 3 | male | 0.684198 | 0.408145 | 0.044404 | False |
Survival Association Test Strategy and Results¶
Strategy:
- Chi-squared test was used to confirm the statistical significance of association between
PclassSurvival Rate andSex(p < 0.05). - Cramer's V value was calculated for each combination to sort features in descending order by association strength.
- Features were split into two categories:
- "Global": Features informing rules shared across Pclass x Sex subgroups.
- "Pclass x Sex Subgroup": Features informing rules constrained to one Pclass x Sex subgroup (e.g. P1 Males, P3 Females).
Test Results:
- The following features will be used as the bases for smoothed survival rate features for modeling (sorted by descending Cramer's V):
- Global Features:
Pclass_Title_normalizedPclass_SexSex_HasCabinTitle_normalizedSex_EmbarkedSex_Parch_SibSp_binPclass_Sex_1_femalePclass_Sex_3_malePclass_Parch_SibSp_binPclass_Deck_binHasCabin_Parch_SibSp_binPclass_EmbarkedPclass_Cabin_Location_sPclass_HasCabinEmbarked_HasCabinPclass_Sex_2_femaleDeck_binParch_SibSp_binPclass_Sex_2_malePclass_Sex_3_female
- Pclass x Sex Subgroup Features:
- Pclass 1, Sex female
Age_GroupParch_SibSp_binDeck_bin
- Pclass 1, Sex male
Age_Group
- Pclass 2, Sex male
Age_GroupTitle_normalizedCabin_Location_sParch_SibSp_binDeck_binHasCabin
- Pclass 3, Sex female
Parch_SibSp_binEmbarkedAge_GroupTitle_normalized
- Pclass 3, Sex male
Age_GroupTitle_normalizedDeck_binParch_SibSp_bin
- Pclass 1, Sex female
- Global Features:
Smoothed Survival Rate Feature Engineering¶
- Smoothed Survival Rate Features ("Smoothed Features") are target-encoded features that use the survival rates observed in the training set data for particular subgroups to inform the survival prediction of the same subgroups in the submission data set.
- To mitigate data leakage, the following actions are performed:
- No information from the test set is ever used in calculating group-level or global survival statistics. All smoothed values for the test set are computed exclusively using the full training set.
- When preparing training data set smoothed features for cross-validation, the full training data set is never used for any calculations of group or global means.
- The means can only be calculated using training fold data
- The calculated smoothed rates are only applied to the validation fold data
- When preparing submission test data set smoothed features, only the full training data set is used.
- Smoothed rates are calculated using a Bayesian adjustment formula to balance subgroup means with the overall global mean, taking into account low-sample subgroups.
- For training data set cross-validation preparation:
grouped['smoothed'] = ((grouped['group_mean'] * grouped['group_count'] + prior * fold_global_mean) /(grouped['group_count'] + prior))
- For submission test data set preparation:
grouped_full['smoothed'] = ((grouped_full['group_mean'] * grouped_full['group_count'] + prior * full_global_mean) /(grouped_full['group_count'] + prior))
- For training data set cross-validation preparation:
- Subgroup sample sizes less than 10 are excluded from the smoothed rate calculation.
- A relevance mask is applied to each smoothed feature to zero-out its values for passengers that do not match the feature's pclass_sex subgroup.
- This ensures that the model only learns from smoothed features that are relevant to the passenger’s actual subgroup.
- Example: A smoothed feature for
Pclass=1, Sex=Femaleis set to0for a male passenger in Pclass=3.
# For use in creating leakage-free standalone smoothed rate features (e.g. Title_bin_smoothed)
def generate_global_smoothed_feature(train_df, target_col, group_col,
test_df=None, prior=10, feature_name=None,
n_splits=5, random_state=42):
"""
Create a globally smoothed target encoding for a categorical feature using CV-based out-of-fold encoding.
Prevents leakage by computing smoothed values within each fold.
Groups with n < 10 are excluded from the smoothed map and default to the global mean.
Parameters:
train_df (pd.DataFrame): Training set.
target_col (str): Target variable (e.g. 'Survived').
group_col (str): Categorical feature to encode (e.g. 'Title_bin').
test_df (pd.DataFrame or None): Optional test set to encode.
prior (int): Smoothing strength for Bayesian mean.
feature_name (str or None): Optional name for the new feature.
n_splits (int): Number of CV folds for OOF encoding.
random_state (int): Seed for reproducibility.
Returns:
str: Name of the generated smoothed feature.
"""
if feature_name is None:
feature_name = f'global_{group_col}_smoothed'
oof_feature = pd.Series(0.0, index=train_df.index)
global_mean = train_df[target_col].mean()
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
for train_idx, val_idx in skf.split(train_df, train_df[target_col]):
fold_train = train_df.iloc[train_idx]
fold_val = train_df.iloc[val_idx]
fold_global_mean = fold_train[target_col].mean()
grouped = (
fold_train.groupby(group_col, observed=True)[target_col]
.agg(['mean', 'count'])
.rename(columns={'mean': 'group_mean', 'count': 'group_count'})
)
# Filter to exclude low-sample groups
grouped = grouped[grouped['group_count'] >= 10]
grouped['smoothed'] = (
(grouped['group_mean'] * grouped['group_count'] + prior * fold_global_mean) /
(grouped['group_count'] + prior)
)
smoothed_map = grouped['smoothed'].to_dict()
val_keys = fold_val[group_col]
smoothed_vals = val_keys.map(smoothed_map).fillna(fold_global_mean)
oof_feature.iloc[val_idx] = smoothed_vals
train_df[feature_name] = oof_feature
print(f"✅ Added feature '{feature_name}' to train_df.")
if test_df is not None:
grouped_full = (
train_df.groupby(group_col, observed=True)[target_col]
.agg(['mean', 'count'])
.rename(columns={'mean': 'group_mean', 'count': 'group_count'})
)
grouped_full = grouped_full[grouped_full['group_count'] >= 10]
grouped_full['smoothed'] = (
(grouped_full['group_mean'] * grouped_full['group_count'] + prior * global_mean) /
(grouped_full['group_count'] + prior)
)
smoothed_map_test = grouped_full['smoothed'].to_dict()
test_keys = test_df[group_col]
smoothed_vals = test_keys.map(smoothed_map_test).fillna(global_mean)
test_df[feature_name] = smoothed_vals
print(f"✅ Added feature '{feature_name}' to test_df.")
return feature_name
# For use in creating leakage-free Pclass x Sex smoothed rate features (e.g. P1_Male_Title_bin_smoothed)
def generate_subgroup_smoothed_feature(train_df, target_col, pclass_val, sex_val, group_col=None,
test_df=None, feature_name=None, prior=10, n_splits=5, random_state=42):
"""
Adds an out-of-fold smoothed target encoding feature to train_df (and optionally test_df),
for a specific Pclass × Sex subgroup, optionally grouped by another column.
If group_col is None, a single smoothed rate is applied to the subgroup.
Parameters:
train_df (pd.DataFrame): Training DataFrame.
target_col (str): Target variable (e.g., 'Survived').
pclass_val (int): Pclass value (1, 2, or 3).
sex_val (str): 'male' or 'female'.
group_col (str or None): If given, compute rates per group_col. Otherwise, single subgroup rate.
test_df (pd.DataFrame, optional): Optional test DataFrame.
feature_name (str, optional): Feature name to assign. Auto-generated if None.
prior (float): Smoothing strength.
n_splits (int): StratifiedKFold folds.
random_state (int): Seed.
Returns:
str: Name of the feature added to train_df (and test_df if given).
"""
group_label = group_col if group_col else "overall"
if feature_name is None:
feature_name = f'P{pclass_val}_{sex_val.capitalize()}_{group_label}_smoothed'
mask_train = (train_df['Pclass'] == pclass_val) & (train_df['Sex'] == sex_val)
subgroup_df = train_df[mask_train].copy()
oof_feature = pd.Series(0.0, index=train_df.index)
if group_col is None:
# Handle subgroup-wide smoothing without further grouping
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
for train_idx, val_idx in skf.split(subgroup_df, subgroup_df[target_col]):
fold_train = subgroup_df.iloc[train_idx]
fold_val = subgroup_df.iloc[val_idx]
fold_global_mean = train_df[target_col].mean()
group_mean = fold_train[target_col].mean()
group_count = len(fold_train)
smoothed_value = (
(group_mean * group_count + prior * fold_global_mean) /
(group_count + prior)
)
oof_feature.iloc[fold_val.index] = smoothed_value
else:
# Normal group_col-specific smoothing
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
for train_idx, val_idx in skf.split(subgroup_df, subgroup_df[target_col]):
fold_train = subgroup_df.iloc[train_idx]
fold_val = subgroup_df.iloc[val_idx]
fold_global_mean = fold_train[target_col].mean()
grouped = (
fold_train.groupby(group_col, observed=True)[target_col]
.agg(['mean', 'count'])
.rename(columns={'mean': 'group_mean', 'count': 'group_count'})
)
grouped = grouped[grouped['group_count'] >= 10].copy()
grouped['smoothed'] = (
(grouped['group_mean'] * grouped['group_count'] + prior * fold_global_mean) /
(grouped['group_count'] + prior)
)
smoothed_map = grouped['smoothed'].to_dict()
val_keys = fold_val[group_col]
oof_feature.loc[fold_val.index] = val_keys.map(smoothed_map).fillna(0.0)
train_df[feature_name] = oof_feature
print(f"✅ Added feature '{feature_name}' to train_df (Pclass={pclass_val}, Sex={sex_val})")
if test_df is not None:
mask_test = (test_df['Pclass'] == pclass_val) & (test_df['Sex'] == sex_val)
test_df[feature_name] = 0.0 # Default value for all
if group_col is None:
global_mean = train_df[target_col].mean()
subgroup_mean = subgroup_df[target_col].mean()
subgroup_count = len(subgroup_df)
smoothed_value = (
(subgroup_mean * subgroup_count + prior * global_mean) /
(subgroup_count + prior)
)
test_df.loc[mask_test, feature_name] = smoothed_value
else:
global_mean = subgroup_df[target_col].mean()
grouped = (
subgroup_df.groupby(group_col, observed=True)[target_col]
.agg(['mean', 'count'])
.rename(columns={'mean': 'group_mean', 'count': 'group_count'})
)
grouped = grouped[grouped['group_count'] >= 10].copy()
grouped['smoothed'] = (
(grouped['group_mean'] * grouped['group_count'] + prior * global_mean) /
(grouped['group_count'] + prior)
)
smoothed_map_test = grouped['smoothed'].to_dict()
test_keys = test_df.loc[mask_test, group_col]
test_df.loc[mask_test, feature_name] = test_keys.map(smoothed_map_test).fillna(0.0)
print(f"✅ Added feature '{feature_name}' to test_df (Pclass={pclass_val}, Sex={sex_val})")
return feature_name
Generate Global Smoothed Features¶
global_feature_list = [
"Pclass_Title_normalized",
"Pclass_Sex",
"Sex_HasCabin",
"Title_normalized",
"Sex_Embarked",
"Sex_Parch_SibSp_bin",
"Pclass_Parch_SibSp_bin",
"Pclass_Deck_bin",
"HasCabin_Parch_SibSp_bin",
"Pclass_Embarked",
"Pclass_Cabin_Location_s",
"Pclass_HasCabin",
"Embarked_HasCabin",
"Deck_bin",
"Parch_SibSp_bin",
# These will be accounted for in the next section
#"Pclass_Sex_2_female",
#"Pclass_Sex_1_female",
#"Pclass_Sex_3_male",
#"Pclass_Sex_2_male",
#"Pclass_Sex_3_female"
]
global_feature_cols = []
for col in global_feature_list:
global_feature_col = generate_global_smoothed_feature(
train_df=prepared_train_df,
target_col='Survived',
group_col=col,
test_df=prepared_test_df # optional; omit if not available
)
global_feature_cols.append(global_feature_col)
✅ Added feature 'global_Pclass_Title_normalized_smoothed' to train_df. ✅ Added feature 'global_Pclass_Title_normalized_smoothed' to test_df. ✅ Added feature 'global_Pclass_Sex_smoothed' to train_df. ✅ Added feature 'global_Pclass_Sex_smoothed' to test_df. ✅ Added feature 'global_Sex_HasCabin_smoothed' to train_df. ✅ Added feature 'global_Sex_HasCabin_smoothed' to test_df. ✅ Added feature 'global_Title_normalized_smoothed' to train_df. ✅ Added feature 'global_Title_normalized_smoothed' to test_df. ✅ Added feature 'global_Sex_Embarked_smoothed' to train_df. ✅ Added feature 'global_Sex_Embarked_smoothed' to test_df. ✅ Added feature 'global_Sex_Parch_SibSp_bin_smoothed' to train_df. ✅ Added feature 'global_Sex_Parch_SibSp_bin_smoothed' to test_df. ✅ Added feature 'global_Pclass_Parch_SibSp_bin_smoothed' to train_df. ✅ Added feature 'global_Pclass_Parch_SibSp_bin_smoothed' to test_df. ✅ Added feature 'global_Pclass_Deck_bin_smoothed' to train_df. ✅ Added feature 'global_Pclass_Deck_bin_smoothed' to test_df. ✅ Added feature 'global_HasCabin_Parch_SibSp_bin_smoothed' to train_df. ✅ Added feature 'global_HasCabin_Parch_SibSp_bin_smoothed' to test_df. ✅ Added feature 'global_Pclass_Embarked_smoothed' to train_df. ✅ Added feature 'global_Pclass_Embarked_smoothed' to test_df. ✅ Added feature 'global_Pclass_Cabin_Location_s_smoothed' to train_df. ✅ Added feature 'global_Pclass_Cabin_Location_s_smoothed' to test_df. ✅ Added feature 'global_Pclass_HasCabin_smoothed' to train_df. ✅ Added feature 'global_Pclass_HasCabin_smoothed' to test_df. ✅ Added feature 'global_Embarked_HasCabin_smoothed' to train_df. ✅ Added feature 'global_Embarked_HasCabin_smoothed' to test_df. ✅ Added feature 'global_Deck_bin_smoothed' to train_df. ✅ Added feature 'global_Deck_bin_smoothed' to test_df. ✅ Added feature 'global_Parch_SibSp_bin_smoothed' to train_df. ✅ Added feature 'global_Parch_SibSp_bin_smoothed' to test_df.
# Construct list of all smoothed feature dicts
smoothed_features_to_create = [
{ 'pclass': 1, 'sex': 'male', 'group_col': None },
{ 'pclass': 2, 'sex': 'male', 'group_col': None },
{ 'pclass': 3, 'sex': 'male', 'group_col': None },
{ 'pclass': 1, 'sex': 'female', 'group_col': None },
{ 'pclass': 2, 'sex': 'female', 'group_col': None },
{ 'pclass': 3, 'sex': 'female', 'group_col': None },
{ 'pclass': 1, 'sex': 'female', 'group_col': 'Age_Group' },
{ 'pclass': 1, 'sex': 'female', 'group_col': 'Parch_SibSp_bin' },
{ 'pclass': 1, 'sex': 'female', 'group_col': 'Deck_bin' },
{ 'pclass': 1, 'sex': 'male', 'group_col': 'Age_Group' },
{ 'pclass': 2, 'sex': 'male', 'group_col': 'Age_Group' },
{ 'pclass': 2, 'sex': 'male', 'group_col': 'Title_normalized' },
{ 'pclass': 2, 'sex': 'male', 'group_col': 'Cabin_Location_s' },
{ 'pclass': 2, 'sex': 'male', 'group_col': 'Parch_SibSp_bin' },
{ 'pclass': 2, 'sex': 'male', 'group_col': 'Deck_bin' },
{ 'pclass': 2, 'sex': 'male', 'group_col': 'HasCabin' },
{ 'pclass': 3, 'sex': 'female', 'group_col': 'Parch_SibSp_bin' },
{ 'pclass': 3, 'sex': 'female', 'group_col': 'Embarked' },
{ 'pclass': 3, 'sex': 'female', 'group_col': 'Age_Group' },
{ 'pclass': 3, 'sex': 'female', 'group_col': 'Title_normalized' },
{ 'pclass': 3, 'sex': 'male', 'group_col': 'Age_Group' },
{ 'pclass': 3, 'sex': 'male', 'group_col': 'Title_normalized' },
{ 'pclass': 3, 'sex': 'male', 'group_col': 'Deck_bin' },
{ 'pclass': 3, 'sex': 'male', 'group_col': 'Parch_SibSp_bin' },
]
smoothed_feature_cols = []
for config in smoothed_features_to_create:
smoothed_feature_col = generate_subgroup_smoothed_feature(prepared_train_df, 'Survived', config['pclass'], config['sex'], config['group_col'],
test_df=prepared_test_df)
smoothed_feature_cols.append(smoothed_feature_col)
print(f"Created {len(smoothed_feature_cols)} smoothed features.")
✅ Added feature 'P1_Male_overall_smoothed' to train_df (Pclass=1, Sex=male) ✅ Added feature 'P1_Male_overall_smoothed' to test_df (Pclass=1, Sex=male) ✅ Added feature 'P2_Male_overall_smoothed' to train_df (Pclass=2, Sex=male) ✅ Added feature 'P2_Male_overall_smoothed' to test_df (Pclass=2, Sex=male) ✅ Added feature 'P3_Male_overall_smoothed' to train_df (Pclass=3, Sex=male) ✅ Added feature 'P3_Male_overall_smoothed' to test_df (Pclass=3, Sex=male) ✅ Added feature 'P1_Female_overall_smoothed' to train_df (Pclass=1, Sex=female) ✅ Added feature 'P1_Female_overall_smoothed' to test_df (Pclass=1, Sex=female) ✅ Added feature 'P2_Female_overall_smoothed' to train_df (Pclass=2, Sex=female) ✅ Added feature 'P2_Female_overall_smoothed' to test_df (Pclass=2, Sex=female) ✅ Added feature 'P3_Female_overall_smoothed' to train_df (Pclass=3, Sex=female) ✅ Added feature 'P3_Female_overall_smoothed' to test_df (Pclass=3, Sex=female) ✅ Added feature 'P1_Female_Age_Group_smoothed' to train_df (Pclass=1, Sex=female) ✅ Added feature 'P1_Female_Age_Group_smoothed' to test_df (Pclass=1, Sex=female) ✅ Added feature 'P1_Female_Parch_SibSp_bin_smoothed' to train_df (Pclass=1, Sex=female) ✅ Added feature 'P1_Female_Parch_SibSp_bin_smoothed' to test_df (Pclass=1, Sex=female) ✅ Added feature 'P1_Female_Deck_bin_smoothed' to train_df (Pclass=1, Sex=female) ✅ Added feature 'P1_Female_Deck_bin_smoothed' to test_df (Pclass=1, Sex=female) ✅ Added feature 'P1_Male_Age_Group_smoothed' to train_df (Pclass=1, Sex=male) ✅ Added feature 'P1_Male_Age_Group_smoothed' to test_df (Pclass=1, Sex=male) ✅ Added feature 'P2_Male_Age_Group_smoothed' to train_df (Pclass=2, Sex=male) ✅ Added feature 'P2_Male_Age_Group_smoothed' to test_df (Pclass=2, Sex=male) ✅ Added feature 'P2_Male_Title_normalized_smoothed' to train_df (Pclass=2, Sex=male) ✅ Added feature 'P2_Male_Title_normalized_smoothed' to test_df (Pclass=2, Sex=male) ✅ Added feature 'P2_Male_Cabin_Location_s_smoothed' to train_df (Pclass=2, Sex=male) ✅ Added feature 'P2_Male_Cabin_Location_s_smoothed' to test_df (Pclass=2, Sex=male) ✅ Added feature 'P2_Male_Parch_SibSp_bin_smoothed' to train_df (Pclass=2, Sex=male) ✅ Added feature 'P2_Male_Parch_SibSp_bin_smoothed' to test_df (Pclass=2, Sex=male) ✅ Added feature 'P2_Male_Deck_bin_smoothed' to train_df (Pclass=2, Sex=male) ✅ Added feature 'P2_Male_Deck_bin_smoothed' to test_df (Pclass=2, Sex=male)
C:\Users\pault\anaconda3\Lib\site-packages\sklearn\model_selection\_split.py:805: UserWarning: The least populated class in y has only 3 members, which is less than n_splits=5. warnings.warn( C:\Users\pault\anaconda3\Lib\site-packages\sklearn\model_selection\_split.py:805: UserWarning: The least populated class in y has only 3 members, which is less than n_splits=5. warnings.warn( C:\Users\pault\anaconda3\Lib\site-packages\sklearn\model_selection\_split.py:805: UserWarning: The least populated class in y has only 3 members, which is less than n_splits=5. warnings.warn( C:\Users\pault\anaconda3\Lib\site-packages\sklearn\model_selection\_split.py:805: UserWarning: The least populated class in y has only 3 members, which is less than n_splits=5. warnings.warn(
✅ Added feature 'P2_Male_HasCabin_smoothed' to train_df (Pclass=2, Sex=male) ✅ Added feature 'P2_Male_HasCabin_smoothed' to test_df (Pclass=2, Sex=male) ✅ Added feature 'P3_Female_Parch_SibSp_bin_smoothed' to train_df (Pclass=3, Sex=female) ✅ Added feature 'P3_Female_Parch_SibSp_bin_smoothed' to test_df (Pclass=3, Sex=female) ✅ Added feature 'P3_Female_Embarked_smoothed' to train_df (Pclass=3, Sex=female) ✅ Added feature 'P3_Female_Embarked_smoothed' to test_df (Pclass=3, Sex=female) ✅ Added feature 'P3_Female_Age_Group_smoothed' to train_df (Pclass=3, Sex=female) ✅ Added feature 'P3_Female_Age_Group_smoothed' to test_df (Pclass=3, Sex=female) ✅ Added feature 'P3_Female_Title_normalized_smoothed' to train_df (Pclass=3, Sex=female) ✅ Added feature 'P3_Female_Title_normalized_smoothed' to test_df (Pclass=3, Sex=female) ✅ Added feature 'P3_Male_Age_Group_smoothed' to train_df (Pclass=3, Sex=male) ✅ Added feature 'P3_Male_Age_Group_smoothed' to test_df (Pclass=3, Sex=male) ✅ Added feature 'P3_Male_Title_normalized_smoothed' to train_df (Pclass=3, Sex=male) ✅ Added feature 'P3_Male_Title_normalized_smoothed' to test_df (Pclass=3, Sex=male) ✅ Added feature 'P3_Male_Deck_bin_smoothed' to train_df (Pclass=3, Sex=male) ✅ Added feature 'P3_Male_Deck_bin_smoothed' to test_df (Pclass=3, Sex=male) ✅ Added feature 'P3_Male_Parch_SibSp_bin_smoothed' to train_df (Pclass=3, Sex=male) ✅ Added feature 'P3_Male_Parch_SibSp_bin_smoothed' to test_df (Pclass=3, Sex=male) Created 24 smoothed features.
Is_Shared_Ticket¶
- Important Notes re: Data Leakage:
- To prevent test data set leakage, it's critical to only use the training data set when calculating the frequency in which a ticket is shared amongst passengers.
- Knowing that tickets are also shared by passengers in the test data set (and likely beyond), this makes the frequency calculation more of a "weight" to aid training prediction, rather than a true calculation of ticket frequency.
- A share count map is implemented here to make the ticket share counts of the training data available in the test data.
- Analysis of shared ticket count amongst training data applied to match tickets in test data revealed the frequency of matching counts dropped significantly for
Share_Ticket_Count >= 1, motivating creating a binaryIs_Shared_Ticketfeature instead to indicate whether or not a given ticket was shared by training data passengers. - Distribution Shift of
Is_Shared_Ticket:- Sparse distribution of test data passengers with tickets matching training set shared tickets
- Will keep in mind during feature experimentation and drop from model if it contributes to overfitting.
def create_feature_Is_Shared_Ticket(train_df, test_df):
"""
Creates "Is_Shared_Ticket" feature, which is a binary integer indicating whether a given ticket was shared by any other training set passenger.
Is_Shared_Ticket of tickets not shared by any other passenger is set to zero / False.
Training data Is_Shared_Ticket value is associated to tickets in the test data that share the same ticket number.
Is_Shared_Ticket of test data ticket numbers that do not appear in the training data is set to zero.
To prevent test data set leakage, it's critical to only use the training data set when calculating the
count ticket is shared amongst passengers.
Args:
train_df (DataFrame): Training data set
test_df (DataFrame): Test data set
Returns:
Nothing
"""
# Calculate ticket counts, including subtracting by one to prevent giving weight to tickets not being shared by others
training_ticket_counts = train_df['Ticket'].value_counts() - 1
is_shared_ticket_df = training_ticket_counts > 0
train_df['Is_Shared_Ticket'] = train_df['Ticket'].map(is_shared_ticket_df).astype(int)
test_df['Is_Shared_Ticket'] = test_df['Ticket'].map(is_shared_ticket_df).fillna(0).astype(int)
create_feature_Is_Shared_Ticket(prepared_train_df, prepared_test_df)
Model Development¶
# Create functions to identify highly-correlated feature pairs to remove
def get_highly_correlated_feature_pairs(df, threshold=0.85):
"""
Returns a DataFrame of feature pairs with correlation above the specified threshold.
Args:
df (pd.DataFrame): DataFrame of features (should be numeric or dummy-encoded).
threshold (float): Minimum correlation to include in output.
Returns:
pd.DataFrame: Correlated feature pairs and their correlation coefficient.
"""
corr_matrix = df.corr().abs()
upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Filter for correlations above the threshold
high_corr = upper_triangle.stack().reset_index()
high_corr.columns = ['Feature A', 'Feature B', 'Correlation']
return high_corr[high_corr['Correlation'] >= threshold].sort_values(by='Correlation', ascending=False)
def identify_lower_importance_correlated_features_to_drop(corr_df, importance_dict, threshold=0.85):
"""
For each correlated feature pair, identifies the feature with lower importance and prints the decision.
Args:
corr_df (pd.DataFrame): Output from get_highly_correlated_feature_pairs().
importance_dict (dict or pd.Series): Feature importance scores (higher is better).
threshold (float): Correlation threshold to consider dropping features.
Returns:
set: Set of feature names to drop.
"""
to_drop = set()
print(f"Features with correlation ≥ {threshold} and lower importance:\n")
for _, row in corr_df.iterrows():
if row['Correlation'] >= threshold:
feat_a = row['Feature A']
feat_b = row['Feature B']
imp_a = importance_dict.get(feat_a, 0)
imp_b = importance_dict.get(feat_b, 0)
if imp_a >= imp_b:
to_drop.add(feat_b)
print(f"Drop: {feat_b:30} (Importance: {imp_b:.5f}) ⬅️ Keep: {feat_a:30} (Importance: {imp_a:.5f})")
else:
to_drop.add(feat_a)
print(f"Drop: {feat_a:30} (Importance: {imp_a:.5f}) ⬅️ Keep: {feat_b:30} (Importance: {imp_b:.5f})")
return to_drop
# Create reusable functions to evaluate learning curve and feature importances (where supported) for each model
def plot_validation_curve(model, param_name, param_range, selected_features=[], drop_cols=[]):
X_all = pd.concat([prepared_train_df[selected_features], prepared_test_df[selected_features]], axis=0)
X_all_encoded = pd.get_dummies(X_all, drop_first=False)
X_train_encoded = X_all_encoded.iloc[:len(train_df)].copy()
X_test_encoded = X_all_encoded.iloc[len(train_df):].copy()
def clean_encodings_for_xgb(train_encoded_df, test_encoded_df):
for df in [train_encoded_df, test_encoded_df]:
df.columns = df.columns.str.replace(r'[<>\[\]\(\)]', '', regex=True)
df.columns = df.columns.str.replace(', ', '_', regex=False)
df.columns = df.columns.str.replace(r'[^0-9a-zA-Z_]', '_', regex=True)
clean_encodings_for_xgb(X_train_encoded, X_test_encoded)
if drop_cols:
X_train_encoded.drop(columns=drop_cols, inplace=True)
X_test_encoded.drop(columns=drop_cols, inplace=True)
y = prepared_train_df['Survived']
print("Evaluating baseline model with the following variables:")
for col in X_train_encoded.columns:
print(f"* {col}")
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train_scores, val_scores = validation_curve(
estimator=model, # e.g., DecisionTreeClassifier
X=X_train_encoded,
y=y,
param_name=param_name,
param_range=param_range,
cv=cv,
scoring='accuracy'
)
# Compute means
train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)
# Plot
plt.plot(param_range, train_mean, label="Training Score")
plt.plot(param_range, val_mean, label="Validation Score")
plt.xlabel(param_name)
plt.ylabel("Accuracy")
plt.title("Validation Curve: max_depth")
title = f"Validation Curve for {param_name}"
plt.title(title)
plt.legend()
plt.grid(True)
plt.show()
def plot_learning_curve(model, X, y, label=None):
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train_sizes, train_scores, val_scores = learning_curve(
estimator=model,
X=X,
y=y,
cv=cv,
scoring='accuracy',
n_jobs=2,
train_sizes=np.linspace(0.1, 1.0, 10)
)
# Compute mean and std
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)
# Plot curves
plt.plot(train_sizes, train_mean, label='Training Score', color='blue')
plt.plot(train_sizes, val_mean, label='Validation Score', color='orange')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.2, color='blue')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.2, color='orange')
# Get final values
final_train_acc = train_mean[-1]
final_val_acc = val_mean[-1]
delta = final_train_acc - final_val_acc
# Annotate final scores and delta with 4 decimal places
plt.text(train_sizes[-1], final_train_acc + 0.005, f"Train: {final_train_acc:.4f}", color='blue')
plt.text(train_sizes[-1], final_val_acc - 0.035, f"Val: {final_val_acc:.4f}", color='orange')
plt.text(train_sizes[-1] * 0.5, min(val_mean) - 0.06,
f"Δ (Train - Val): {delta:.4f}", fontsize=10, style='italic', color='gray')
# Labels and formatting
plt.xlabel("Training Set Size")
plt.ylabel("Accuracy")
title = f"Learning Curve for {label}" if label else "Learning Curve"
plt.title(title)
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
def plot_feature_importances(model, X):
importances = pd.Series(model.feature_importances_, index=X.columns)
importances = importances.sort_values(ascending=False)
# Print feature importances in a grid format
importance_df = pd.DataFrame({
'Feature': importances.index,
'Importance': importances.values
})
print("\nEstimator Feature Importances:\n")
print(importance_df.to_string(index=False))
# Plot
plt.figure(figsize=(8, 5))
ax = importances.plot(kind='bar')
plt.title(f"Feature Importances from {model.__class__.__name__}")
plt.ylabel("Relative Importance")
# Add value labels above each bar
for i, value in enumerate(importances):
plt.text(i, value + 0.001, f"{value:.3f}", ha='center', va='bottom', fontsize=9)
plt.tight_layout()
plt.show()
def analyze_mistake_overrepresentation(worst_fold_mistakes, full_df):
bool_cols = worst_fold_mistakes.select_dtypes(bool).columns
mistake_counts = worst_fold_mistakes[bool_cols].sum()
full_counts = full_df[bool_cols].sum()
full_counts = full_counts.replace(0, pd.NA)
ratio = (mistake_counts / full_counts).sort_values(ascending=False)
comparison_df = pd.DataFrame({
'Mistake_Count': mistake_counts,
'Train_Count': full_counts,
'Overrepresentation_Rate': ratio
}).dropna().sort_values(by='Overrepresentation_Rate', ascending=False)
return comparison_df
def plot_feature_cooccurrence_heatmap(worst_fold_mistakes, focus_col, exclude_cols=None):
"""
Creates a heatmap showing the number of mistakes where focus_col and each other feature are both 1.
Parameters:
- worst_fold_mistakes: DataFrame containing only the mistaken predictions
- focus_col: The binary feature to cross with all others (e.g. 'Pclass_3')
- exclude_cols: Optional list of columns to ignore (e.g. ['Actual', 'Predicted'])
"""
if exclude_cols is None:
exclude_cols = ['Actual', 'Predicted']
feature_cols = [col for col in worst_fold_mistakes.columns if col not in exclude_cols + [focus_col]]
# Filter to rows where focus_col == 1
relevant_mistakes = worst_fold_mistakes[worst_fold_mistakes[focus_col] == 1]
# Count co-occurrences
counts = {}
for col in feature_cols:
counts[col] = (relevant_mistakes[col] == 1).sum()
# Convert to DataFrame for heatmap
df_counts = pd.DataFrame.from_dict(counts, orient='index', columns=['Mistake_Count']).sort_values('Mistake_Count', ascending=False)
# Plot
plt.figure(figsize=(8, len(df_counts) * 0.4 + 1))
sns.heatmap(df_counts.T, annot=True, cmap='Reds', cbar=False, fmt='d')
plt.title(f"Mistake Counts When {focus_col} == 1")
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
# Create reusable methods to accelerate feature experimentation
def iterate_model(selected_features=[], drop_cols=[], feature_importances=False, permutation_importances=False, learning_curve=False, analyze_mistakes=True, **model_params):
X_train_selected = prepared_train_df[selected_features].copy()
X_test_selected = prepared_test_df[selected_features].copy()
y_train_full = prepared_train_df['Survived']
if drop_cols:
X_train_selected.drop(columns=drop_cols, inplace=True)
X_test_selected.drop(columns=drop_cols, inplace=True)
print(f"\nDropping these variables from model input:")
for col in drop_cols:
print(f"* {col}")
clf = XGBClassifier(**model_params)
print(f"\nEvaluating {clf.__class__.__name__} model with the following variables:")
for col in X_train_selected.columns:
print(f"* {col}")
cv_scores = []
fold_indices = []
all_preds = []
oof_preds = np.zeros_like(y_train_full)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in skf.split(X_train_selected, y_train_full):
X_train, X_val = X_train_selected.iloc[train_idx], X_train_selected.iloc[test_idx]
y_train, y_val = y_train_full.iloc[train_idx], y_train_full.iloc[test_idx]
clf.fit(X_train, y_train)
preds = clf.predict(X_val)
cv_scores.append(accuracy_score(y_val, preds))
fold_indices.append((train_idx, test_idx))
all_preds.append(preds)
oof_preds[test_idx] = preds
worst_fold_idx = np.argmin(cv_scores)
worst_test_idx = fold_indices[worst_fold_idx][1]
X_worst = X_train_selected.iloc[worst_test_idx].copy()
y_worst = y_train_full.iloc[worst_test_idx].copy()
preds_worst = all_preds[worst_fold_idx]
mistakes = X_worst[y_worst != preds_worst].copy()
mistakes['Actual'] = y_worst[y_worst != preds_worst]
mistakes['Predicted'] = preds_worst[y_worst != preds_worst]
print(f"\nCV Scores: {cv_scores}")
print(f"Worst Fold Index: {worst_fold_idx}")
print(f"Mean Accuracy: {np.mean(cv_scores):.4f}")
print(f"Standard Deviation: {np.std(cv_scores):.4f}")
## Retrain classifier will full training dataset + sample weights
clf.fit(X_train_selected, y_train_full)
if learning_curve:
plot_learning_curve(clf, X_train_selected, y_train_full, label=f"{len(X_train_selected.columns)} columns")
if feature_importances:
plot_feature_importances(clf, X_train_selected)
if permutation_importances:
result = permutation_importance(clf, X_train_selected, y_train_full, n_repeats=10, random_state=42, scoring='accuracy')
importances = pd.DataFrame({
'feature': X_train_selected.columns,
'importance_mean': result.importances_mean,
'importance_std': result.importances_std
}).sort_values(by='importance_mean', ascending=False)
print("\nPermutation Importances:")
print(importances)
print()
corr_df = get_highly_correlated_feature_pairs(X_train_selected, threshold=0.85)
importance_dict = dict(zip(clf.feature_names_in_, clf.feature_importances_))
features_to_drop = identify_lower_importance_correlated_features_to_drop(corr_df, importance_dict)
print(f"Found {len(features_to_drop)} feature(s) to drop.\n\n")
# Return (fitted model, filtered test_df, mistakes, filtered train_df) tuple for submission pipeline
return clf, X_test_selected, oof_preds, X_train_selected
Baseline Establishment¶
Baseline accuracy scores are established to gauge the progress and efficacy of developing models using our engineered features.
Predict Majority Class¶
First baseline taken simply calculates the accuracy of a simulated prediction output that assigns every passenger to the majority target class.
- Baseline Accuracy of Predicting Majority Class: 62%
# Create a majority-class series and score it against the target values in the training data
target = prepared_train_df['Survived']
majority_class = target.mode()[0]
baseline_majority = pd.Series([majority_class] * len(target))
baseline_accuracy = accuracy_score(target, baseline_majority)
print(f"Baseline majority-class accuracy: {baseline_accuracy:.2f}")
Baseline majority-class accuracy: 0.62
Predict Simple Model¶
- This baseline trains an
XGBoostClassifierusing only one-hot encoded Pclass_Sex features to evaluate their standalone predictive power. - A 5-fold stratified cross-validation setup is used to train and evaluate the model. Accuracy is calculated as the mean across folds, with the Kth fold used as "unseen" validation data each iteration.
- In addition to model accuracy, we capture diagnostics to guide future iterations:
- Learning Curve: Plots the model's accuracy in predicting the training set and the validation (unseen) set as it increases training set size. Helps quickly assess the extent to which the model is overfitting or underfitting the data.
- Feature Importances: Reports which features the model relied on the most when making decisions.
- Permutation Importances: Tests how much each feature affects model accuracy by randomly shuffling its values.
Summary Observations:
- Average Accuracy of XGBClassifier with Pclass_Sex OHE Features Only: 77.4%
- Learning Curve: Suggests model slight underfitting, as can be seen by training and validation accuracy decreasing in close proximity as training set size increases
- Feature Importances:
- Feature contributions are relatively balanced in that no feature is dominating the model's decision making at the expensive of others.
- Most used feature is
Pclass_Sex_3_female, used in 35.4% of model splits - Least used feature is
Pclass_Sex_1_female, used in 5.7% of model splits
- Permutation Importances:
- Surprisingly, the
Pclass_Sex_3_femalefeature -- the most used feature -- has the lowest Permutation Importance with a mean of 0.07% - This suggests the model may be overusing this feature despite its limited true value, possibly due to group imbalances or confounded patterns
- Surprisingly, the
baseline_input_features = pclass_sex_oh_cols
baseline_drop_cols = []
baseline_model, baseline_X_test_encoded, worst_fold_mistakes, X_train_encoded = iterate_model(
baseline_input_features,
baseline_drop_cols,
feature_importances=True,
permutation_importances=True,
learning_curve=True
)
Evaluating XGBClassifier model with the following variables: * Pclass_Sex_1_female * Pclass_Sex_1_male * Pclass_Sex_2_female * Pclass_Sex_2_male * Pclass_Sex_3_female * Pclass_Sex_3_male CV Scores: [0.7821229050279329, 0.7640449438202247, 0.7696629213483146, 0.7640449438202247, 0.7921348314606742] Worst Fold Index: 1 Mean Accuracy: 0.7744 Standard Deviation: 0.0111
Estimator Feature Importances:
Feature Importance
Pclass_Sex_3_female 0.353995
Pclass_Sex_3_male 0.291663
Pclass_Sex_2_male 0.143335
Pclass_Sex_1_female 0.081116
Pclass_Sex_2_female 0.072465
Pclass_Sex_1_male 0.057427
Permutation Importances:
feature importance_mean importance_std
5 Pclass_Sex_3_male 0.168911 0.010942
3 Pclass_Sex_2_male 0.073850 0.004071
0 Pclass_Sex_1_female 0.046352 0.009629
2 Pclass_Sex_2_female 0.034007 0.010014
1 Pclass_Sex_1_male 0.031987 0.003339
4 Pclass_Sex_3_female 0.000673 0.007137
Features with correlation ≥ 0.85 and lower importance:
Found 0 feature(s) to drop.
Engineered Features Test¶
selected_feature_list = pclass_sex_oh_cols.tolist() + global_feature_cols + smoothed_feature_cols
selected_drop_cols = [
# Baseline Accuracy before dropping: 0.8013
# Ablation Testing Round 1
'global_Pclass_Title_normalized_smoothed', # 0.8013 -> 0.8070
#'global_Title_normalized_smoothed', # 0.8070 -> 0.7879 (!)
'Pclass_Sex_1_female', # 0.8070 -> 0.8070
'Pclass_Sex_1_male', # 0.8070 -> 0.8070
'Pclass_Sex_2_female', # 0.8070 -> 0.8070
'Pclass_Sex_2_male', # 0.8070 -> 0.8070
'Pclass_Sex_3_female', # 0.8070 -> 0.8070
'Pclass_Sex_3_male', # 0.8070 -> 0.8070
'P1_Female_overall_smoothed', # 0.8070 -> 0.8103
'P2_Male_overall_smoothed', # 0.8103 -> 0.8126
#'P3_Female_overall_smoothed' # 0.8126 -> 0.8070 (!)
'P1_Male_overall_smoothed', # 0.8126 -> 0.8137
'P2_Male_Deck_bin_smoothed', # 0.8137 -> 0.8137
'P2_Male_Cabin_Location_s_smoothed', # 0.8137 -> 0.8137
#'global_Pclass_Cabin_Location_s_smoothed' # 0.8137 -> 0.8092 (!)
#'global_Pclass_HasCabin_smoothed', # 0.8137 -> 0.8126 (!)
#'P3_Male_Deck_bin_smoothed' # 0.8137 -> 0.8047 (!)
'P3_Female_overall_smoothed', # 0.8137 -> 0.8148
'P3_Male_overall_smoothed', # 0.8148 -> 0.8148
#'global_Pclass_HasCabin_smoothed' # 0.8148 -> 0.8104 (!)
#'global_Pclass_Deck_bin_smoothed' # 0.8148 -> 0.8059 (!)
#'global_Sex_Parch_SibSp_bin_smoothed' # 0.8148 -> 0.8126 (!)
#'global_Sex_Embarked_smoothed' # 0.8148 -> 0.8070 (!)
#'P3_Female_Embarked_smoothed' # 0.8148 -> 0.8103 (!)
#'P3_Female_Title_normalized_smoothed' # 0.8148 -> 0.8137 (!)
#'P3_Female_Parch_SibSp_bin_smoothed' # 0.8148 -> 0.8137
#'P3_Male_Age_Group_smoothed' # 0.8148 -> 0.8092
'P3_Female_Age_Group_smoothed', # 0.8148 -> 0.8171
'P2_Male_HasCabin_smoothed', # 0.8171 -> 0.8182
'global_Pclass_Sex_smoothed', # 0.8182 -> 0.8182
'P2_Male_Age_Group_smoothed', # Acc: 0.8182 -> 0.8159 Std: 0.0182 -> 0.0088
'P3_Male_Deck_bin_smoothed', # 0.8159 -> 0.8204
#'global_Sex_HasCabin_smoothed' # 0.8204 -> 0.8137 (!)
#'P3_Male_Parch_SibSp_bin_smoothed' # 0.8204 -> 0.8182 (!)
#'P1_Female_Parch_SibSp_bin_smoothed' # 0.8204 -> 0.8193 (!)
#'P1_Female_Deck_bin_smoothed', # 0.8204 -> 0.8182 (!)
#'P1_Female_Age_Group_smoothed' # 0.8204 -> 0.8103 (!)
'P3_Male_Deck_bin_smoothed', # 0.8204 -> 0.8204
'P3_Male_Title_normalized_smoothed', # 0.8204 -> 0.8216
#'global_Deck_bin_smoothed', # 0.8216 -> 0.8171
#'global_Pclass_Embarked_smoothed' # 0.8216 -> 0.8036
#'global_Embarked_HasCabin_smoothed' # 0.8216 -> 0.8159
#'P2_Female_overall_smoothed' # 0.8216 -> 0.8193
# Ablation Testing Round 2
'P1_Female_Parch_SibSp_bin_smoothed',
'P1_Female_Deck_bin_smoothed',
# * P1_Male_Age_Group_smoothed
'P2_Male_Title_normalized_smoothed',
'P2_Male_Parch_SibSp_bin_smoothed',
'P3_Female_Parch_SibSp_bin_smoothed',
# * P3_Female_Embarked_smoothed
'P3_Female_Title_normalized_smoothed',
# * P3_Male_Age_Group_smoothed
'P3_Male_Parch_SibSp_bin_smoothed',
'global_HasCabin_Parch_SibSp_bin_smoothed',
'global_Pclass_Cabin_Location_s_smoothed',
'global_Sex_HasCabin_smoothed',
'global_Title_normalized_smoothed',
'global_Sex_Embarked_smoothed',
'global_Sex_Parch_SibSp_bin_smoothed',
'global_Pclass_Parch_SibSp_bin_smoothed',
#'global_Pclass_Deck_bin_smoothed',
'global_Pclass_Embarked_smoothed',
'global_Pclass_HasCabin_smoothed',
'global_Embarked_HasCabin_smoothed',
'global_Deck_bin_smoothed',
#'global_Parch_SibSp_bin_smoothed',
#'P2_Female_overall_smoothed',
#'P1_Female_Age_Group_smoothed',
'P1_Male_Age_Group_smoothed',
#'P3_Female_Embarked_smoothed',
#'P3_Male_Age_Group_smoothed'
]
submission_model, submission_X_test_selected, oof_preds, X_train_selected = iterate_model(selected_feature_list, selected_drop_cols,
feature_importances=True, permutation_importances=True, learning_curve=True,
max_depth=3,
min_child_weight=1,
gamma=1,
subsample=0.6,
colsample_bytree=1,
learning_rate=0.01,
n_estimators=250,
reg_alpha=0,
reg_lambda=1,
eval_metric='error'
)
Dropping these variables from model input: * global_Pclass_Title_normalized_smoothed * Pclass_Sex_1_female * Pclass_Sex_1_male * Pclass_Sex_2_female * Pclass_Sex_2_male * Pclass_Sex_3_female * Pclass_Sex_3_male * P1_Female_overall_smoothed * P2_Male_overall_smoothed * P1_Male_overall_smoothed * P2_Male_Deck_bin_smoothed * P2_Male_Cabin_Location_s_smoothed * P3_Female_overall_smoothed * P3_Male_overall_smoothed * P3_Female_Age_Group_smoothed * P2_Male_HasCabin_smoothed * global_Pclass_Sex_smoothed * P2_Male_Age_Group_smoothed * P3_Male_Deck_bin_smoothed * P3_Male_Deck_bin_smoothed * P3_Male_Title_normalized_smoothed * P1_Female_Parch_SibSp_bin_smoothed * P1_Female_Deck_bin_smoothed * P2_Male_Title_normalized_smoothed * P2_Male_Parch_SibSp_bin_smoothed * P3_Female_Parch_SibSp_bin_smoothed * P3_Female_Title_normalized_smoothed * P3_Male_Parch_SibSp_bin_smoothed * global_HasCabin_Parch_SibSp_bin_smoothed * global_Pclass_Cabin_Location_s_smoothed * global_Sex_HasCabin_smoothed * global_Title_normalized_smoothed * global_Sex_Embarked_smoothed * global_Sex_Parch_SibSp_bin_smoothed * global_Pclass_Parch_SibSp_bin_smoothed * global_Pclass_Embarked_smoothed * global_Pclass_HasCabin_smoothed * global_Embarked_HasCabin_smoothed * global_Deck_bin_smoothed * P1_Male_Age_Group_smoothed Evaluating XGBClassifier model with the following variables: * global_Pclass_Deck_bin_smoothed * global_Parch_SibSp_bin_smoothed * P2_Female_overall_smoothed * P1_Female_Age_Group_smoothed * P3_Female_Embarked_smoothed * P3_Male_Age_Group_smoothed CV Scores: [0.8156424581005587, 0.8202247191011236, 0.7808988764044944, 0.8202247191011236, 0.8202247191011236] Worst Fold Index: 2 Mean Accuracy: 0.8114 Standard Deviation: 0.0154
Estimator Feature Importances:
Feature Importance
P3_Male_Age_Group_smoothed 0.285831
P2_Female_overall_smoothed 0.211786
P1_Female_Age_Group_smoothed 0.199908
P3_Female_Embarked_smoothed 0.128072
global_Pclass_Deck_bin_smoothed 0.093019
global_Parch_SibSp_bin_smoothed 0.081384
Permutation Importances:
feature importance_mean importance_std
2 P2_Female_overall_smoothed 0.079686 0.006696
3 P1_Female_Age_Group_smoothed 0.069809 0.006540
5 P3_Male_Age_Group_smoothed 0.045230 0.007360
4 P3_Female_Embarked_smoothed 0.030415 0.004747
0 global_Pclass_Deck_bin_smoothed 0.010550 0.003521
1 global_Parch_SibSp_bin_smoothed 0.005612 0.003367
Features with correlation ≥ 0.85 and lower importance:
Found 0 feature(s) to drop.
Hyperparameter Tuning¶
plot_validation_curve(submission_model, 'reg_lambda', [0, 1, 5, 10], selected_feature_list, selected_drop_cols)
Evaluating baseline model with the following variables: * global_Pclass_Deck_bin_smoothed * global_Parch_SibSp_bin_smoothed * P2_Female_overall_smoothed * P1_Female_Age_Group_smoothed * P3_Female_Embarked_smoothed * P3_Male_Age_Group_smoothed
Out-of-Fold Prediction Mistake Analysis¶
def compute_oof_subgroup_mistakes(oof_preds, y_true, group_col, train_df):
"""
Computes number of mistakes per group based on out-of-fold predictions.
Args:
oof_preds (np.ndarray or pd.Series): Out-of-fold predicted labels (0/1).
y_true (np.ndarray or pd.Series): Ground truth labels.
group_col (str): Column name in train_df to group by (e.g., 'Pclass_Sex').
train_df (pd.DataFrame): DataFrame that includes the group_col.
Returns:
pd.DataFrame: Mistake counts and mistake rates by group.
"""
df = train_df.copy()
df['y_true'] = y_true
df['y_pred'] = oof_preds
df['mistake'] = df['y_true'] != df['y_pred']
result = (
df.groupby(group_col)
.agg(
Mistake_Count=('mistake', 'sum'),
Train_Count=(group_col, 'count')
)
.assign(Mistake_Rate=lambda d: d['Mistake_Count'] / d['Train_Count'])
.sort_values('Mistake_Count', ascending=False)
)
return result
- The following data counts the number of mistakes generated for each
Pclass_Sexgroup found in the training set. - We see the most mistakes were made within the
3_femalegroup, which is the 2nd largest group represented in the training set.
prepared_train_df['Pclass_Sex'] = (
prepared_train_df['Pclass'].astype(str) + '_' + prepared_train_df['Sex'].astype(str)
)
oof_mistakes_df = compute_oof_subgroup_mistakes(
oof_preds=oof_preds,
y_true=prepared_train_df['Survived'],
group_col="Pclass_Sex",
train_df=prepared_train_df
)
display(oof_mistakes_df)
| Mistake_Count | Train_Count | Mistake_Rate | |
|---|---|---|---|
| Pclass_Sex | |||
| 3_female | 48 | 144 | 0.333333 |
| 3_male | 47 | 347 | 0.135447 |
| 1_male | 44 | 122 | 0.360656 |
| 2_male | 16 | 108 | 0.148148 |
| 1_female | 7 | 94 | 0.074468 |
| 2_female | 6 | 76 | 0.078947 |
SHAP Analysis of Mistakes¶
- We use SHAP analysis to understand the magnitude and direction each feature contributes to the model's mistaken prediction of
3_femalesurvival. - The analysis revealed the model used features that were not intended to affect
3_femalesurvival. - The model used the following features to predict the survival of a
3_female, all irrelevant to the class:P3_Male_Age Group_smoothedP2_Female_overall_smoothedP1_Female_Age_Group_smoothed
# Zoom in on 3_female mistakes
mask_3_female = (prepared_train_df['Pclass'] == 3) & (prepared_train_df['Sex'] == 'female')
X_3_female = X_train_selected[mask_3_female]
mistake_mask = (oof_preds != prepared_train_df['Survived']) & mask_3_female
X_3_female_mistakes = X_train_selected[mistake_mask]
explainer = shap.Explainer(submission_model)
X_shap_safe = X_3_female_mistakes.copy()
X_shap_safe.fillna(0.0, inplace=True) # or 0.0, if neutral
shap_values = explainer(X_shap_safe)
shap.plots.beeswarm(shap_values, show=False)
plt.title("3_female SHAP Values")
plt.show()
shap.plots.waterfall(shap_values[0], show=False)
plt.title("3_female SHAP Mistake Example")
plt.show()
shap.plots.waterfall(shap_values[1], show=False)
plt.title("3_female SHAP Mistake Example")
plt.show()
shap.plots.waterfall(shap_values[2], show=False)
plt.title("3_female SHAP Mistake Example")
plt.show()
shap.plots.waterfall(shap_values[3], show=False)
plt.title("3_female SHAP Mistake Example")
plt.show()
shap.plots.waterfall(shap_values[4], show=False)
plt.title("3_female SHAP Mistake Example")
plt.show()
Submission¶
# Confirm submission input features
selected_feature_list
['Pclass_Sex_1_female', 'Pclass_Sex_1_male', 'Pclass_Sex_2_female', 'Pclass_Sex_2_male', 'Pclass_Sex_3_female', 'Pclass_Sex_3_male', 'global_Pclass_Title_normalized_smoothed', 'global_Pclass_Sex_smoothed', 'global_Sex_HasCabin_smoothed', 'global_Title_normalized_smoothed', 'global_Sex_Embarked_smoothed', 'global_Sex_Parch_SibSp_bin_smoothed', 'global_Pclass_Parch_SibSp_bin_smoothed', 'global_Pclass_Deck_bin_smoothed', 'global_HasCabin_Parch_SibSp_bin_smoothed', 'global_Pclass_Embarked_smoothed', 'global_Pclass_Cabin_Location_s_smoothed', 'global_Pclass_HasCabin_smoothed', 'global_Embarked_HasCabin_smoothed', 'global_Deck_bin_smoothed', 'global_Parch_SibSp_bin_smoothed', 'P1_Male_overall_smoothed', 'P2_Male_overall_smoothed', 'P3_Male_overall_smoothed', 'P1_Female_overall_smoothed', 'P2_Female_overall_smoothed', 'P3_Female_overall_smoothed', 'P1_Female_Age_Group_smoothed', 'P1_Female_Parch_SibSp_bin_smoothed', 'P1_Female_Deck_bin_smoothed', 'P1_Male_Age_Group_smoothed', 'P2_Male_Age_Group_smoothed', 'P2_Male_Title_normalized_smoothed', 'P2_Male_Cabin_Location_s_smoothed', 'P2_Male_Parch_SibSp_bin_smoothed', 'P2_Male_Deck_bin_smoothed', 'P2_Male_HasCabin_smoothed', 'P3_Female_Parch_SibSp_bin_smoothed', 'P3_Female_Embarked_smoothed', 'P3_Female_Age_Group_smoothed', 'P3_Female_Title_normalized_smoothed', 'P3_Male_Age_Group_smoothed', 'P3_Male_Title_normalized_smoothed', 'P3_Male_Deck_bin_smoothed', 'P3_Male_Parch_SibSp_bin_smoothed']
selected_drop_cols
['global_Pclass_Title_normalized_smoothed', 'Pclass_Sex_1_female', 'Pclass_Sex_1_male', 'Pclass_Sex_2_female', 'Pclass_Sex_2_male', 'Pclass_Sex_3_female', 'Pclass_Sex_3_male', 'P1_Female_overall_smoothed', 'P2_Male_overall_smoothed', 'P1_Male_overall_smoothed', 'P2_Male_Deck_bin_smoothed', 'P2_Male_Cabin_Location_s_smoothed', 'P3_Female_overall_smoothed', 'P3_Male_overall_smoothed', 'P3_Female_Age_Group_smoothed', 'P2_Male_HasCabin_smoothed', 'global_Pclass_Sex_smoothed', 'P2_Male_Age_Group_smoothed', 'P3_Male_Deck_bin_smoothed', 'P3_Male_Deck_bin_smoothed', 'P3_Male_Title_normalized_smoothed', 'P1_Female_Parch_SibSp_bin_smoothed', 'P1_Female_Deck_bin_smoothed', 'P2_Male_Title_normalized_smoothed', 'P2_Male_Parch_SibSp_bin_smoothed', 'P3_Female_Parch_SibSp_bin_smoothed', 'P3_Female_Title_normalized_smoothed', 'P3_Male_Parch_SibSp_bin_smoothed', 'global_HasCabin_Parch_SibSp_bin_smoothed', 'global_Pclass_Cabin_Location_s_smoothed', 'global_Sex_HasCabin_smoothed', 'global_Title_normalized_smoothed', 'global_Sex_Embarked_smoothed', 'global_Sex_Parch_SibSp_bin_smoothed', 'global_Pclass_Parch_SibSp_bin_smoothed', 'global_Pclass_Embarked_smoothed', 'global_Pclass_HasCabin_smoothed', 'global_Embarked_HasCabin_smoothed', 'global_Deck_bin_smoothed', 'P1_Male_Age_Group_smoothed']
submission_X_test_selected.isnull().sum().loc[lambda x: x > 0]
Series([], dtype: int64)
# Confirm no unexpected columns
submission_X_test_selected.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 global_Pclass_Deck_bin_smoothed 418 non-null float64 1 global_Parch_SibSp_bin_smoothed 418 non-null float64 2 P2_Female_overall_smoothed 418 non-null float64 3 P1_Female_Age_Group_smoothed 418 non-null float64 4 P3_Female_Embarked_smoothed 418 non-null float64 5 P3_Male_Age_Group_smoothed 418 non-null float64 dtypes: float64(6) memory usage: 19.7 KB
# Confirm submission model configuration
submission_model
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None, colsample_bytree=1,
device=None, early_stopping_rounds=None, enable_categorical=False,
eval_metric='error', feature_types=None, feature_weights=None,
gamma=1, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.01, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=3, max_leaves=None,
min_child_weight=1, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=250, n_jobs=None,
num_parallel_tree=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None, colsample_bytree=1,
device=None, early_stopping_rounds=None, enable_categorical=False,
eval_metric='error', feature_types=None, feature_weights=None,
gamma=1, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.01, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=3, max_leaves=None,
min_child_weight=1, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=250, n_jobs=None,
num_parallel_tree=None, ...)submission = pd.DataFrame({
'PassengerId': test_df['PassengerId'],
'Survived': submission_model.predict(submission_X_test_selected)
})
timestamp = datetime.now().strftime('%Y%m%d-%H%M%S')
filename = f'./submission-{timestamp}.csv'
submission.to_csv(filename, index=False)
References¶
- (1) "Titanic Deckplans." Encyclopedia Titanica, https://www.encyclopedia-titanica.org/titanic-deckplans/
Table of Contents Generator¶
import json
import re
def slugify(text):
text = text.strip()
text = re.sub(r'[^\w\s\-]', '', text) # allow (), - and alphanumerics
return re.sub(r'[\s]+', '-', text)
def extract_headings(ipynb_path):
with open(ipynb_path, 'r', encoding='utf-8') as f:
nb = json.load(f)
toc_lines = ["## Table of Contents\n"]
for cell in nb['cells']:
if cell['cell_type'] == 'markdown':
for line in cell['source']:
match = re.match(r'^(#{2,6})\s+(.*)', line)
if match:
level = len(match.group(1)) - 1 # offset for nesting
title = match.group(2).strip()
anchor = slugify(title)
indent = ' ' * (level - 1)
toc_lines.append(f"{indent}1. [{title}](#{anchor})")
return '\n'.join(toc_lines)
# Example usage:
toc = extract_headings("titanic_ml_from_disaster__paultongyoo.ipynb")
print(toc)
## Table of Contents
1. [Table of Contents](#Table-of-Contents)
1. [Project Summary](#Project-Summary)
1. [What I Did](#What-I-Did)
1. [What I Learned](#What-I-Learned)
1. [What's Next](#Whats-Next)
1. [Introduction](#Introduction)
1. [Methodology](#Methodology)
1. [Data Understanding](#Data-Understanding)
1. [Data Dictionary](#Data-Dictionary)
1. [Variable Notes](#Variable-Notes)
1. [Descriptive Statistics](#Descriptive-Statistics)
1. [Row Samples](#Row-Samples)
1. [Data Types](#Data-Types)
1. [Missing Values Summary](#Missing-Values-Summary)
1. [Data Preparation](#Data-Preparation)
1. [Missing Value Imputation](#Missing-Value-Imputation)
1. [Embarked](#Embarked)
1. [Cabin](#Cabin)
1. [Age](#Age)
1. [Fare](#Fare)
1. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
1. [Target](#Target)
1. [Individual Features x Target](#Individual-Features-x-Target)
1. [Pclass](#Pclass)
1. [Sex](#Sex)
1. [SibSp](#SibSp)
1. [Parch](#Parch)
1. [Embarked_](#Embarked_)
1. [HasCabin](#HasCabin)
1. [Cabin_count](#Cabin_count)
1. [Cabin_Location_s](#Cabin_Location_s)
1. [Deck](#Deck)
1. [Title](#Title)
1. [Age_](#Age_)
1. [Age_Group](#Age_Group)
1. [Fare_](#Fare_)
1. [Summary of Single Feature Relationship with Target](#Summary-of-Single-Feature-Relationship-with-Target)
1. [Composite Feature x Target](#Composite-Feature-x-Target)
1. [Pclass x Sex](#Pclass-x-Sex)
1. [Pclass x Title](#Pclass-x-Title)
1. [Pclass x Parch](#Pclass-x-Parch)
1. [Pclass x SibSp](#Pclass-x-SibSp)
1. [Sex x Parch](#Sex-x-Parch)
1. [Sex x SibSp](#Sex-x-SibSp)
1. [Pclass x Embarked](#Pclass-x-Embarked)
1. [Sex x Embarked](#Sex-x-Embarked)
1. [Pclass x HasCabin](#Pclass-x-HasCabin)
1. [Sex x HasCabin](#Sex-x-HasCabin)
1. [Parch x HasCabin](#Parch-x-HasCabin)
1. [SibSp x HasCabin](#SibSp-x-HasCabin)
1. [Embarked x HasCabin](#Embarked-x-HasCabin)
1. [Pclass x Cabin_count](#Pclass-x-Cabin_count)
1. [Sex x Cabin_count](#Sex-x-Cabin_count)
1. [Pclass x Cabin_Location_s](#Pclass-x-Cabin_Location_s)
1. [Sex x Cabin_Location_s](#Sex-x-Cabin_Location_s)
1. [Pclass x Deck_bin](#Pclass-x-Deck_bin)
1. [Sex x Deck_bin](#Sex-x-Deck_bin)
1. [Parch x Deck_bin](#Parch-x-Deck_bin)
1. [SibSp x Deck_bin](#SibSp-x-Deck_bin)
1. [Deck x Cabin_Location_s](#Deck-x-Cabin_Location_s)
1. [Pclass x Title_bin](#Pclass-x-Title_bin)
1. [Sex x Title_bin](#Sex-x-Title_bin)
1. [Pclass x Age_Group](#Pclass-x-Age_Group)
1. [Sex x Age_Group](#Sex-x-Age_Group)
1. [Pclass x FPP_log_bin](#Pclass-x-FPP_log_bin)
1. [Sex x FPP_log_bin](#Sex-x-FPP_log_bin)
1. [Pclass x Parch_SibSp](#Pclass-x-Parch_SibSp)
1. [Sex x Parch_SibSp](#Sex-x-Parch_SibSp)
1. [HasCabin x Parch_SibSp](#HasCabin-x-Parch_SibSp)
1. [Hi-Cardinality Features](#Hi-Cardinality-Features)
1. [Ticket](#Ticket)
1. [Feature Priority Based on EDA](#Feature-Priority-Based-on-EDA)
1. [Cross-Fold Distribution Shift Analysis](#Cross-Fold-Distribution-Shift-Analysis)
1. [Feature Engineering](#Feature-Engineering)
1. [Reduce Distribution Shift of Select Features](#Reduce-Distribution-Shift-of-Select-Features)
1. [Pclass x Age_Group](#Pclass-x-Age_Group)
1. [Pclass_HasCabin](#Pclass_HasCabin)
1. [Sex x HasCabin](#Sex-x-HasCabin)
1. [Embarked x HasCabin](#Embarked-x-HasCabin)
1. [Parch_SibSp_bin](#Parch_SibSp_bin)
1. [HasCabin x Parch_SibSp_bin](#HasCabin-x-Parch_SibSp_bin)
1. [Pclass x Parch_SibSp_bin](#Pclass-x-Parch_SibSp_bin)
1. [Sex x Parch_SibSp_bin](#Sex-x-Parch_SibSp_bin)
1. [Pclass x Embarked](#Pclass-x-Embarked)
1. [Sex x Embarked](#Sex-x-Embarked)
1. [Pclass x Deck_bin](#Pclass-x-Deck_bin)
1. [Pclass x Cabin_Location_s](#Pclass-x-Cabin_Location_s)
1. [Pclass x Normalized Title](#Pclass-x-Normalized-Title)
1. [Deck_bin](#Deck_bin)
1. [Title_normalized](#Title_normalized)
1. [Pclass_Sex One-Hot Encodings](#Pclass_Sex-One-Hot-Encodings)
1. [Survival Association Tests](#Survival-Association-Tests)
1. [Global Feature Survival Association Tests](#Global-Feature-Survival-Association-Tests)
1. [Pclass x Sex Subgroup Feature Survival Association Tests](#Pclass-x-Sex-Subgroup-Feature-Survival-Association-Tests)
1. [Survival Association Test Strategy and Results](#Survival-Association-Test-Strategy-and-Results)
1. [Smoothed Survival Rate Feature Engineering](#Smoothed-Survival-Rate-Feature-Engineering)
1. [Generate Global Smoothed Features](#Generate-Global-Smoothed-Features)
1. [Is_Shared_Ticket](#Is_Shared_Ticket)
1. [Model Development](#Model-Development)
1. [Baseline Establishment](#Baseline-Establishment)
1. [Predict Majority Class](#Predict-Majority-Class)
1. [Predict Simple Model](#Predict-Simple-Model)
1. [Engineered Features Test](#Engineered-Features-Test)
1. [Hyperparameter Tuning](#Hyperparameter-Tuning)
1. [Out-of-Fold Prediction Mistake Analysis](#Out-of-Fold-Prediction-Mistake-Analysis)
1. [SHAP Analysis of Mistakes](#SHAP-Analysis-of-Mistakes)
1. [Submission](#Submission)
1. [References](#References)
1. [Table of Contents Generator](#Table-of-Contents-Generator)